Commit graph

16285 commits

Author SHA1 Message Date
Changli Gao
deffd77759 net: arp: code cleanup
Clean the code up according to Documentation/CodingStyle.

Don't initialize the variable dont_send in arp_process().

Remove the temporary varialbe flags in arp_state_to_flags().

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-02 10:12:06 -07:00
Eric Dumazet
c07b68e841 net: dev_add_pack() & __dev_remove_pack() changes
Add a small helper ptype_head() to get the head to manipulate

dev_add_pack() & __dev_remove_pack() can use a spinlock without
blocking BH, since softirq use RCU, and these functions are run from
process context only.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-02 10:12:06 -07:00
Julian Anastasov
8ed2163ff3 ipvs: use pkts for SCTP too
Use correctly the in_pkts packet counter also for SCTP

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-02 10:04:18 -07:00
Eric Dumazet
95f4b45bc6 net: another last_rx round
Kill last_rx use in l2tp and two net drivers

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-02 09:19:32 -07:00
Gerrit Renker
0705c6f0e2 tcp: update also tcp_output with regard to RFC 5681
Thanks to Ilpo Jarvinen, this updates also the initial window
setting for tcp_output with regard to RFC 5681.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-01 18:15:00 -07:00
stephen hemminger
fa50d64576 net: make rx_queue sysfs_ops const
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-01 18:12:20 -07:00
Eric Dumazet
6602cebb5b net: skbuff.c cleanup
(skb->data - skb->head) can be changed by skb_headroom(skb)

Remove some uses of NET_SKBUFF_DATA_USES_OFFSET, using
(skb_end_pointer(skb) - skb->head) or
(skb_tail_pointer(skb) - skb->head) : compiler does the right thing,
and this is more readable for us ;)

(struct skb_shared_info *) casts in pskb_expand_head() to help memcpy()
to use aligned moves.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-01 10:57:55 -07:00
Eric Dumazet
86cac58b71 skge: add GRO support
- napi_gro_flush() is exported from net/core/dev.c, to avoid
  an irq_save/irq_restore in the packet receive path.
- use napi_gro_receive() instead of netif_receive_skb()
- use napi_gro_flush() before calling __napi_complete()
- turn on NETIF_F_GRO by default
- Tested on a Marvell 88E8001 Gigabit NIC

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-01 10:57:55 -07:00
Eric Dumazet
875168a933 net: tunnels should use rcu_dereference
tunnel4_handlers, tunnel64_handlers, tunnel6_handlers and
tunnel46_handlers are protected by RCU, but we dont use appropriate rcu
primitives to scan them. rcu_lock() is already held by caller.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-01 10:57:53 -07:00
Eric Dumazet
1639ab6f78 gro: unexport tcp4_gro_receive and tcp4_gro_complete
tcp4_gro_receive() and tcp4_gro_complete() dont need to be exported.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-31 13:37:07 -07:00
Eric Dumazet
ba4fd9d828 pktgen: remove non used variable
remove non used variable "queue" in pg_cleanup

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-31 13:37:06 -07:00
Jiri Pirko
72ed62f7c9 vlan: Use vlan_dev_real_dev in vlan_hwaccel_do_receive
[patch net-next-2.6] vlan: Use vlan_dev_real_dev in vlan_hwaccel_do_receive

Use helper as in other places.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-31 13:37:05 -07:00
Rémi Denis-Courmont
01b38606bd Phonet: do not set POLLOUT in case of send buffer overflow
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-31 13:04:33 -07:00
Rémi Denis-Courmont
02ac3268a5 Phonet: correct sendmsg() error code from sock_alloc_send_skb()
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-31 13:04:32 -07:00
Rémi Denis-Courmont
1a98214fee Phonet: restore flow control credits when sending fails
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-31 13:04:32 -07:00
Eric Dumazet
3ff2cfa55f ipv6: struct xfrm6_tunnel in read_mostly section
tunnel6_handlers chain being scanned for each incoming packet,
make sure it doesnt share an often dirtied cache line.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-30 13:50:46 -07:00
Eric Dumazet
6dcd814bd0 net: struct xfrm_tunnel in read_mostly section
tunnel4_handlers chain being scanned for each incoming packet,
make sure it doesnt share an often dirtied cache line.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-30 13:50:45 -07:00
Gerrit Renker
89858ad143 dccp ccid-3: use per-route RTO or TCP RTO as fallback
This makes RTAX_RTO_MIN also available to CCID-3, replacing the compile-time
RTO lower bound with a per-route tunable value.

The original Kconfig option solved the problem that a very low RTT (in the
order of HZ) can trigger too frequent and unnecessary reductions of the
sending rate.

This tunable does not affect the initial RTO value of 2 seconds specified in
RFC 5348, section 4.2 and Appendix B. But like the hardcoded Kconfig value,
it allows to adapt to network conditions.

The same effect as the original Kconfig option of 100ms is now achieved by

> ip route replace to unicast 192.168.0.0/24 rto_min 100j dev eth0

(assuming HZ=1000).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-30 13:45:28 -07:00
Gerrit Renker
4886fcad6e dccp ccid-2: Share TCP's minimum RTO code
Using a fixed RTO_MIN of 0.2 seconds was found to cause problems for CCID-2
over 802.11g: at least once per session there was a spurious timeout. It
helped to then increase the the value of RTO_MIN over this link.

Since the problem is the same as in TCP, this patch makes the solution from
commit "05bb1fad1cde025a864a90cfeb98dcbefe78a44a"
       "[TCP]: Allow minimum RTO to be configurable via routing metrics."
available to DCCP.

This avoids reinventing the wheel, so that e.g. the following works in the
expected way now also for CCID-2:

> ip route change 10.0.0.2 rto_min 800 dev ath0

Luckily this useful rto_min function was recently moved to net/tcp.h,
which simplifies sharing code originating from TCP.

Documentation also updated (plus minor whitespace fixes).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-30 13:45:27 -07:00
Gerrit Renker
22b71c8f4f tcp/dccp: Consolidate common code for RFC 3390 conversion
This patch consolidates initial-window code common to TCP and CCID-2:
 * TCP uses RFC 3390 in a packet-oriented manner (tcp_input.c) and
 * CCID-2 uses RFC 3390 in packet-oriented manner (RFC 4341).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-30 13:45:26 -07:00
Gerrit Renker
d26eeb07fd dccp ccid-2: Remove wrappers around sk_{reset,stop}_timer()
This removes the wrappers around the sk timer functions, since not much is
gained from using them: the BUG_ON in start_rto_timer will never trigger
since that function is called only if:

 * the RTO timer expires (rto_expire, and then timer_pending() is false);
 * in tx_packet_sent only if !timer_pending() (BUG_ON is redundant here);
 * previously in new_ack, after stopping the timer (timer_pending() false).

Removing the wrappers also clears the way for eventually replacing the
RTO timer with the icsk-retransmission-timer, as it is already part of the
DCCP socket.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-30 13:45:26 -07:00
Gerrit Renker
d82b6f85c1 dccp ccid-2: Use u32 timestamps uniformly
Since CCID-2 is de facto a mini implementation of TCP, it makes sense to share
as much code as possible.

Hence this patch aligns CCID-2 timestamping with TCP timestamping.
This also halves the space consumption (on 64-bit systems).

The necessary include file <net/tcp.h> is already included by way of
net/dccp.h. Redundant includes have been removed.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-30 13:45:25 -07:00
Jerry Chu
dca43c75e7 tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".

TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.

Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.

The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.

The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.

This option, like many others, will be inherited by an acceptor from its
listener.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-30 13:23:33 -07:00
Stephen Rothwell
2c70b51962 IPVS: include net/ip6_checksum.h for csum_ipv6_magic
Fixes this build error:

net/netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_nat_icmp_v6':
net/netfilter/ipvs/ip_vs_core.c:640: error: implicit declaration of function 'csum_ipv6_magic'

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-29 21:15:26 -07:00
Akinobu Mita
6a499b242f phonet: use for_each_set_bit
Replace open-coded loop with for_each_set_bit().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-28 15:37:04 -07:00
Akinobu Mita
762c29164e econet: kill unnecessary spin_lock_init()
The spinlock aun_queue_lock is initialized statically. It is unnecessary
to initialize by spin_lock_init() at module load time.

This is detected by the semantic patch.

// <smpl>
@def@
declarer name DEFINE_SPINLOCK;
identifier spinlock;
@@

DEFINE_SPINLOCK(spinlock);

@@
identifier def.spinlock;
@@

- spin_lock_init(&spinlock);
// </smpl>

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Julia Lawall <julia@diku.dk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-28 15:37:03 -07:00
Eric Dumazet
40d0802b3e gro: __napi_gro_receive() optimizations
compare_ether_header() can have a special implementation on 64 bit
arches if CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is defined.

__napi_gro_receive() and vlan_gro_common() can avoid a conditional
branch to perform device match.

On x86_64, __napi_gro_receive() has now 38 instructions instead of 53

As gcc-4.4.3 still choose to not inline it, add inline keyword to this
performance critical function.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-26 22:03:08 -07:00
Changli Gao
53f91dc1f7 net: use scnprintf() to avoid potential buffer overflow
strlcpy() returns the total length of the string they tried to create, so
we should not use its return value without any check. scnprintf() returns
the number of characters written into @buf not including the trailing '\0',
so use it instead here.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-26 14:11:49 -07:00
Joe Perches
145ce502e4 net/sctp: Use pr_fmt and pr_<level>
Change SCTP_DEBUG_PRINTK and SCTP_DEBUG_PRINTK_IPADDR to
use do { print } while (0) guards.
Add SCTP_DEBUG_PRINTK_CONT to fix errors in log when
lines were continued.
Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
Add a missing newline in "Failed bind hash alloc"

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-26 14:11:48 -07:00
Simon Horman
dee06e4702 ipvs: switch to GFP_KERNEL allocations
Switch from GFP_ATOMIC allocations to GFP_KERNEL ones in
ip_vs_add_service() and ip_vs_new_dest(), as we hold a mutex and are
allowed to sleep in this context.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-26 13:21:29 -07:00
Simon Horman
4f72816ef0 IPVS: convert __ip_vs_securetcp_lock to a spinlock
Also rename __ip_vs_securetcp_lock to ip_vs_securetcp_lock.

Spinlock conversion was suggested by Eric Dumazet.

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-26 13:21:29 -07:00
Simon Horman
bd14455048 IPVS: convert __ip_vs_sched_lock to a spinlock
Also rename __ip_vs_sched_lock to ip_vs_sched_lock.

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-26 13:21:28 -07:00
Simon Horman
8870f8427b IPVS: ICMPv6 checksum calculation
Cc: Xiaoyu Du <tingsrain@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-26 13:21:26 -07:00
stephen hemminger
aa7c6e5fa0 bridge: avoid ethtool on non running interface
If bridge port is offline, don't call ethtool to query speed.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-25 16:36:51 -07:00
Stephen Hemminger
944c794d64 bridge: fix locking comment
The carrier check is not called from work queue in current code.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-25 16:36:50 -07:00
Julia Lawall
b2aff96327 net/netfilter/ipvs: Eliminate memory leak
__ip_vs_service_get and __ip_vs_svc_fwm_get increment a reference count, so
that reference count should be decremented before leaving the function in an
error case.

A simplified version of the semantic match that finds this problem is:
(http://coccinelle.lip6.fr/)

// <smpl>
@r exists@
local idexpression x;
expression E;
identifier f1;
iterator I;
@@

x = __ip_vs_service_get(...);
<... when != x
     when != true (x == NULL || ...)
     when != if (...) { <+...x...+> }
     when != I (...) { <+...x...+> }
(
 x == NULL
|
 x == E
|
 x->f1
)
...>
* return ...;
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-25 16:36:50 -07:00
John W. Linville
e569aa78ba Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 into for-davem
Conflicts:
	drivers/net/wireless/libertas/if_sdio.c
2010-08-25 14:51:42 -04:00
Stephen Hemminger
c2e3143e3c tc: add meta match on receive hash
Trivial extension to existing meta data match rules to allow
matching on skb receive hash value.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-24 14:48:10 -07:00
Eric Dumazet
ec550d246e net: ip_append_data() optim
Compiler is not smart enough to avoid a conditional branch.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-24 14:45:09 -07:00
Johannes Berg
2e161f78e5 cfg80211/mac80211: extensible frame processing
Allow userspace to register for more than just
action frames by giving the frame subtype, and
make it possible to use this in various modes
as well.

With some tweaks and some added functionality
this will, in the future, also be usable in AP
mode and be able to replace the cooked monitor
interface currently used in that case.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-08-24 16:27:56 -04:00
Johannes Berg
ac4c977d16 mac80211: remove unused don't-encrypt flag
When MFP is disabled, action frames will not
be encrypted since they are management frames
and the only management frames that can then
be encrypted are authentication frames.

Therefore, setting the don't-encrypt flag on
action frames is unnecessary.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-08-24 16:27:55 -04:00
Johannes Berg
633adf1ad1 cfg80211: mark ieee80211_hdrlen const
This function analyses only its single, value-passed
argument, and has no side effects. Thus it can be
const, which makes mac80211 smaller, for example:

   text	   data	    bss	    dec	    hex	filename
 362518	  16720	    884	 380122	  5ccda	mac80211.ko (before)
 362358	  16720	    884	 379962	  5cc3a	mac80211.ko (after)

a 160 byte saving in text size, and an optimisation
because the function won't be called as often.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-08-24 16:27:54 -04:00
stephen hemminger
0fdc100bdc ethtool: allow non-netadmin to query settings
The SNMP daemon uses ethtool to determine the speed of
network interfaces. This fails on Debian (and probably elsewhere)
because for security SNMP daemon runs as non-root user (snmp).

Note: A similar patch was rejected previously because of a concern about
the possibility that on some hardware querying the ethtool settings
requires access to the PHY and could slow the machine down.  But the
security risk of requiring SNMP daemon (and related services)
to run as root far out weighs the risk of denial-of-service.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:43:16 -07:00
Eric Dumazet
afdcba371f net: copy_rtnl_link_stats64() simplification
No need to use a temporary struct rtnl_link_stats64 variable,
just copy the source to skb buffer.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:43:16 -07:00
Changli Gao
0eec32ff35 net_sched: act_csum: coding style cleanup
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:43:15 -07:00
David S. Miller
7abac68602 pkt_sched: Make act_csum depend upon INET.
It uses ip_send_check() and stuff like that.

Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:42:11 -07:00
Gerrit Renker
231cc2aaf1 dccp ccid-2: Replace broken RTT estimator with better algorithm
The current CCID-2 RTT estimator code is in parts broken and lags behind the
suggestions in RFC2988 of using scaled variants for SRTT/RTTVAR.

That code is replaced by the present patch, which reuses the Linux TCP RTT
estimator code.

Further details:
----------------
 1. The minimum RTO of previously one second has been replaced with TCP's, since
    RFC4341, sec. 5 says that the minimum of 1 sec. (suggested in RFC2988, 2.4)
    is not necessary. Instead, the TCP_RTO_MIN is used, which agrees with DCCP's
    concept of a default RTT (RFC 4340, 3.4).
 2. The maximum RTO has been set to DCCP_RTO_MAX (64 sec), which agrees with
    RFC2988, (2.5).
 3. De-inlined the function ccid2_new_ack().
 4. Added a FIXME: the RTT is sampled several times per Ack Vector, which will
    give the wrong estimate. It should be replaced with one sample per Ack.
    However, at the moment this can not be resolved easily, since
    - it depends on TX history code (which also needs some work),
    - the cleanest solution is not to use the `sent' time at all (saves 4 bytes
      per entry) and use DCCP timestamps / elapsed time to estimated the RTT,
      which however is non-trivial to get right (but needs to be done).

Reasons for reusing the Linux TCP estimator algorithm:
------------------------------------------------------
Some time was spent to find a better alternative, using basic RFC2988 as a first
step. Further analysis and experimentation showed that the Linux TCP RTO
estimator is superior to a basic RFC2988 implementation. A summary is on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/ccid2/rto_estimator/

In addition, this estimator fared well in a recent empirical evaluation:

    Rewaskar, Sushant, Jasleen Kaur and F. Donelson Smith.
    A Performance Study of Loss Detection/Recovery in Real-world TCP
    Implementations. Proceedings of 15th IEEE International
    Conference on Network Protocols (ICNP-07), 2007.

Thus there is significant benefit in reusing the existing TCP code.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:13:31 -07:00
Gerrit Renker
c38c92a84a dccp ccid-2: Simplify dec_pipe and rearming of RTO timer
This removes the dec_pipe function and improves the way the RTO timer is rearmed
when a new acknowledgment comes in.

Details and justification for removal:
--------------------------------------
 1) The BUG_ON in dec_pipe is never triggered: pipe is only decremented for TX
    history entries between tail and head, for which it had previously been
    incremented in tx_packet_sent; and it is not decremented twice for the same
    entry, since it is
    - either decremented when a corresponding Ack Vector cell in state 0 or 1
      was received (and then ccid2s_acked==1),
    - or it is decremented when ccid2s_acked==0, as part of the loss detection
      in tx_packet_recv (and hence it can not have been decremented earlier).

 2) Restarting the RTO timer happens for every single entry in each Ack Vector
    parsed by tx_packet_recv (according to RFC 4340, 11.4 this can happen up to
    16192 times per Ack Vector).

 3) The RTO timer should not be restarted when all outstanding data has been
    acknowledged. This is currently done similar to (2), in dec_pipe, when
    pipe has reached 0.

The patch onsolidates the code which rearms the RTO timer, combining the
segments from new_ack and dec_pipe. As a result, the code becomes clearer
(compare with tcp_rearm_rto()).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:13:31 -07:00
Gerrit Renker
30564e3555 dccp ccid-2: Remove redundant sanity tests
This removes the ccid2_hc_tx_check_sanity function: it is redundant.

Details:

The tx_check_sanity function performs three tests:
 1) it checks that the circular TX list is sorted
    - in ascending order of sequence number (ccid2s_seq)
    - and time (ccid2s_sent),
    - in the direction from `tail' (hctx_seqt) to `head' (hctx_seqh);
 2) it ensures that the entire list has the length seqbufc * CCID2_SEQBUF_LEN;
 3) it ensures that pipe equals the number of packets that were not
    marked `acked' (ccid2s_acked) between `tail' and `head'.

The following argues that each of these tests is redundant, this can be verified
by going through the code.

(1) is not necessary, since both time and GSS increase from one packet to the
next, so that subsequent insertions in tx_packet_sent (which advance the `head'
pointer) will be in ascending order of time and sequence number.

In (2), the length of the list is always equal to seqbufc times CCID2_SEQBUF_LEN
(set to 1024) unless allocation caused an earlier failure, because:
 * at initialisation (tx_init), there is one chunk of size 1024 and seqbufc=1;
 * subsequent calls to tx_alloc_seq take place whenever head->next == tail in
   tx_packet_sent; then a new chunk of size 1024 is inserted between head and
   tail, and seqbufc is incremented by one.

To show that (3) is redundant requires looking at two cases.

The `pipe' variable of the TX socket is incremented only in tx_packet_sent, and
decremented in tx_packet_recv.  When head == tail (TX history empty) then pipe
should be 0, which is the case directly after initialisation and after a
retransmission timeout has occurred (ccid2_hc_tx_rto_expire).

The first case involves parsing Ack Vectors for packets recorded in the live
portion of the buffer, between tail and head. For each packet marked by the
receiver as received (state 0) or ECN-marked (state 1), pipe is decremented by
one, so for all such packets the BUG_ON in tx_check_sanity will not trigger.

The second case is the loss detection in the second half of tx_packet_recv,
below the comment "Check for NUMDUPACK".

The first while-loop here ensures that the sequence number of `seqp' is either
above or equal to `high_ack', or otherwise equal to the highest sequence number
sent so far (of the entry head->prev, as head points to the next unsent entry).
The next while-loop ("while (1)") counts the number of acked packets starting
from that position of seqp, going backwards in the direction from head->prev to
tail. If NUMDUPACK=3 such packets were counted within this loop, `seqp' points
to the last acknowledged packet of these, and the "if (done == NUMDUPACK)" block
is entered next.
The while-loop contained within that block in turn traverses the list backwards,
from head to tail; the position of `seqp' is saved in the variable `last_acked'.
For each packet not marked as `acked', a congestion event is triggered within
the loop, and pipe is decremented. The loop terminates when `seqp' has reached
`tail', whereupon tail is set to the position previously stored in `last_acked'.
Thus, between `last_acked' and the previous position of `tail',
 - pipe has been decremented earlier if the packet was marked as state 0 or 1;
 - pipe was decremented if the packet was not marked as acked.
That is, pipe has been decremented by the number of packets between `last_acked'
and the previous position of `tail'. As a consequence, pipe now again reflects
the number of packets which have not (yet) been acked between the new position
of tail (at `last_acked') and head->prev, or 0 if head==tail. The result is that
the BUG_ON condition in check_sanity will also not be triggered, hence the test
(3) is also redundant.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:13:30 -07:00
Gerrit Renker
51c22bb510 dccp ccid-3: No more CCID control blocks in LISTEN state
The CCIDs are activated as last of the features, at the end of the handshake,
were the LISTEN state of the master socket is inherited into the server
state of the child socket. Thus, the only states visible to CCIDs now are
OPEN/PARTOPEN, and the closing states.

This allows to remove tests which were previously necessary to protect
against referencing a socket in the listening state (in CCID-3), but which
now have become redundant.

As a further byproduct of enabling the CCIDs only after the connection has been
fully established, several typecast-initialisations of ccid3_hc_{rx,tx}_sock
can now be eliminated:
 * the CCID is loaded, so it is not necessary to test if it is NULL,
 * if it is possible to load a CCID and leave the private area NULL, then this
    is a bug, which should crash loudly - and earlier,
 * the test for state==OPEN || state==PARTOPEN now reduces only to the closing
   phase (e.g. when the node has received an unexpected Reset).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 20:13:30 -07:00