Commit graph

701 commits

Author SHA1 Message Date
Angelo P. Castellani
f890f92104 [TCP]: TCP Compound congestion control
TCP Compound is a sender-side only change to TCP that uses
a mixed Reno/Vegas approach to calculate the cwnd.

For further details look here:
  ftp://ftp.research.microsoft.com/pub/tr/TR-2005-86.pdf

Signed-off-by: Angelo P. Castellani <angelo.castellani@gmail.com>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:25 -07:00
Bin Zhou
76f1017757 [TCP]: TCP Veno congestion control
TCP Veno module is a new congestion control module to improve TCP
performance over wireless networks. The key innovation in TCP Veno is
the enhancement of TCP Reno/Sack congestion control algorithm by using
the estimated state of a connection based on TCP Vegas. This scheme
significantly reduces "blind" reduction of TCP window regardless of
the cause of packet loss.

This work is based on the research paper "TCP Veno: TCP Enhancement
for Transmission over Wireless Access Networks." C. P. Fu, S. C. Liew,
IEEE Journal on Selected Areas in Communication, Feb. 2003.

Original paper and many latest research works on veno:
 http://www.ntu.edu.sg/home/ascpfu/veno/veno.html

Signed-off-by: Bin Zhou <zhou0022@ntu.edu.sg>
	       Cheng Peng Fu <ascpfu@ntu.edu.sg>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:23 -07:00
Wong Hoi Sing Edison
7c106d7e78 [TCP]: TCP Low Priority congestion control
TCP Low Priority is a distributed algorithm whose goal is to utilize only
 the excess network bandwidth as compared to the ``fair share`` of
 bandwidth as targeted by TCP. Available from:
   http://www.ece.rice.edu/~akuzma/Doc/akuzma/TCP-LP.pdf

Original Author:
 Aleksandar Kuzmanovic <akuzma@northwestern.edu>

See http://www-ece.rice.edu/networks/TCP-LP/ for their implementation.
As of 2.6.13, Linux supports pluggable congestion control algorithms.
Due to the limitation of the API, we take the following changes from
the original TCP-LP implementation:
 o We use newReno in most core CA handling. Only add some checking
   within cong_avoid.
 o Error correcting in remote HZ, therefore remote HZ will be keeped
   on checking and updating.
 o Handling calculation of One-Way-Delay (OWD) within rtt_sample, sicne
   OWD have a similar meaning as RTT. Also correct the buggy formular.
 o Handle reaction for Early Congestion Indication (ECI) within
   pkts_acked, as mentioned within pseudo code.
 o OWD is handled in relative format, where local time stamp will in
   tcp_time_stamp format.

Port from 2.4.19 to 2.6.16 as module by:
 Wong Hoi Sing Edison <hswong3i@gmail.com>
 Hung Hing Lun <hlhung3i@gmail.com>

Signed-off-by: Wong Hoi Sing Edison <hswong3i@gmail.com>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:21 -07:00
Alexey Dobriyan
c45fb1089e [NETFILTER]: PPTP helper: fixup gre_keymap_lookup() return type
GRE keys are 16-bit wide.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:17 -07:00
Patrick McHardy
ae5b7d8ba2 [NETFILTER]: Add SIP connection tracking helper
Add SIP connection tracking helper. Originally written by
Christian Hentschel <chentschel@arnet.com.ar>, some cleanup, minor
fixes and bidirectional SIP support added by myself.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:15 -07:00
Patrick McHardy
e44ab66a75 [NETFILTER]: H.323 helper: replace internal_net_addr parameter by routing-based heuristic
Call Forwarding doesn't need to create an expectation if both peers can
reach each other without our help. The internal_net_addr parameter
lets the user explicitly specify a single network where this is true,
but is not very flexible and even fails in the common case that calls
will both be forwarded to outside parties and inside parties. Use an
optional heuristic based on routing instead, the assumption is that
if bpth the outgoing device and the gateway are equal, both peers can
reach each other directly.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:13 -07:00
Jing Min Zhao
c0d4cfd96d [NETFILTER]: H.323 helper: Add support for Call Forwarding
Signed-off-by: Jing Min Zhao <zhaojingmin@users.sourceforge.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:11 -07:00
Patrick McHardy
c952616934 [NETFILTER]: amanda helper: convert to textsearch infrastructure
When a port number within a packet is replaced by a differently sized
number only the packet is resized, but not the copy of the data.
Following port numbers are rewritten based on their offsets within
the copy, leading to packet corruption.

Convert the amanda helper to the textsearch infrastructure to avoid
the copy entirely.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:09 -07:00
Patrick McHardy
7d8c501817 [NETFILTER]: FTP helper: search optimization
Instead of skipping search entries for the wrong direction simply index
them by direction.

Based on patch by Pablo Neira <pablo@netfilter.org>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:07 -07:00
Patrick McHardy
695ecea329 [NETFILTER]: SNMP helper: fix debug module param type
debug is the debug level, not a bool.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:05 -07:00
Patrick McHardy
89f2e21883 [NETFILTER]: ctnetlink: change table dumping not to require an unique ID
Instead of using the ID to find out where to continue dumping, take a
reference to the last entry dumped and try to continue there.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:03 -07:00
Patrick McHardy
3726add766 [NETFILTER]: ctnetlink: fix NAT configuration
The current configuration only allows to configure one manip and overloads
conntrack status flags with netlink semantic.

Signed-off-by: Patrick Mchardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:29:01 -07:00
Eric Leblond
997ae831ad [NETFILTER]: conntrack: add fixed timeout flag in connection tracking
Add a flag in a connection status to have a non updated timeout.
This permits to have connection that automatically die at a given
time.

Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:59 -07:00
Patrick McHardy
39a27a35c5 [NETFILTER]: conntrack: add sysctl to disable checksumming
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:57 -07:00
Patrick McHardy
6442f1cf89 [NETFILTER]: conntrack: don't call helpers for related ICMP messages
None of the existing helpers expects to get called for related ICMP
packets and some even drop them if they can't parse them.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:55 -07:00
Patrick McHardy
404bdbfd24 [NETFILTER]: recent match: replace by rewritten version
Replace the unmaintainable ipt_recent match by a rewritten version that
should be fully compatible.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:53 -07:00
Patrick McHardy
957dc80ac3 [NETFILTER]: x_tables: add SCTP/DCCP support where missing
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:47 -07:00
Patrick McHardy
3e72b2fe5b [NETFILTER]: x_tables: remove some unnecessary casts
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:45 -07:00
Herbert Xu
31a4ab9302 [IPSEC] proto: Move transport mode input path into xfrm_mode_transport
Now that we have xfrm_mode objects we can move the transport mode specific
input decapsulation code into xfrm_mode_transport.  This removes duplicate
code as well as unnecessary header movement in case of tunnel mode SAs
since we will discard the original IP header immediately.

This also fixes a minor bug for transport-mode ESP where the IP payload
length is set to the correct value minus the header length (with extension
headers for IPv6).

Of course the other neat thing is that we no longer have to allocate
temporary buffers to hold the IP headers for ESP and IPComp.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:41 -07:00
Herbert Xu
b59f45d0b2 [IPSEC] xfrm: Abstract out encapsulation modes
This patch adds the structure xfrm_mode.  It is meant to represent
the operations carried out by transport/tunnel modes.

By doing this we allow additional encapsulation modes to be added
without clogging up the xfrm_input/xfrm_output paths.

Candidate modes include 4-to-6 tunnel mode, 6-to-4 tunnel mode, and
BEET modes.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:39 -07:00
Herbert Xu
546be2405b [IPSEC] xfrm: Undo afinfo lock proliferation
The number of locks used to manage afinfo structures can easily be reduced
down to one each for policy and state respectively.  This is based on the
observation that the write locks are only held by module insertion/removal
which are very rare events so there is no need to further differentiate
between the insertion of modules like ipv6 versus esp6.

The removal of the read locks in xfrm4_policy.c/xfrm6_policy.c might look
suspicious at first.  However, after you realise that nobody ever takes
the corresponding write lock you'll feel better :)

As far as I can gather it's an attempt to guard against the removal of
the corresponding modules.  Since neither module can be unloaded at all
we can leave it to whoever fixes up IPv6 unloading :)

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:28:37 -07:00
David S. Miller
15986e1aad [TCP]: tcp_rcv_rtt_measure_ts() call in pure-ACK path is superfluous
We only want to take receive RTT mesaurements for data
bearing frames, here in the header prediction fast path
for a pure-sender, we know that we have a pure-ACK and
thus the checks in tcp_rcv_rtt_mesaure_ts() will not pass.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:26:16 -07:00
Chris Leech
1a2449a87b [I/OAT]: TCP recv offload to I/OAT
Locks down user pages and sets up for DMA in tcp_recvmsg, then calls
dma_async_try_early_copy in tcp_v4_do_rcv

Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:25:56 -07:00
Chris Leech
9593782585 [I/OAT]: Add a sysctl for tuning the I/OAT offloaded I/O threshold
Any socket recv of less than this ammount will not be offloaded

Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:25:54 -07:00
Chris Leech
624d116473 [I/OAT]: Make sk_eat_skb I/OAT aware.
Add an extra argument to sk_eat_skb, and make it move early copied
packets to the async_wait_queue instead of freeing them.

Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:25:52 -07:00
Chris Leech
0e4b4992b8 [I/OAT]: Rename cleanup_rbuf to tcp_cleanup_rbuf and make non-static
Needed to be able to call tcp_cleanup_rbuf in tcp_input.c for I/OAT

Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-17 21:25:50 -07:00
Weidong
42d1d52e69 [IPV4]: Increment ipInHdrErrors when TTL expires.
Signed-off-by: Weidong <weid@nanjing-fnst.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-12 13:09:59 -07:00
Aki M Nyrhinen
79320d7e14 [TCP]: continued: reno sacked_out count fix
From: Aki M Nyrhinen <anyrhine@cs.helsinki.fi>

IMHO the current fix to the problem (in_flight underflow in reno)
is incorrect.  it treats the symptons but ignores the problem. the
problem is timing out packets other than the head packet when we
don't have sack. i try to explain (sorry if explaining the obvious).

with sack, scanning the retransmit queue for timed out packets is
fine because we know which packets in our retransmit queue have been
acked by the receiver.

without sack, we know only how many packets in our retransmit queue the
receiver has acknowledged, but no idea which packets.

think of a "typical" slow-start overshoot case, where for example
every third packet in a window get lost because a router buffer gets
full.

with sack, we check for timeouts on those every third packet (as the
rest have been sacked). the packet counting works out and if there
is no reordering, we'll retransmit exactly the packets that were 
lost.

without sack, however, we check for timeout on every packet and end up
retransmitting consecutive packets in the retransmit queue. in our
slow-start example, 2/3 of those retransmissions are unnecessary. these
unnecessary retransmissions eat the congestion window and evetually
prevent fast recovery from continuing, if enough packets were lost.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-11 21:18:56 -07:00
Herbert Xu ~{PmVHI~}
f291196979 [TCP]: Avoid skb_pull if possible when trimming head
Trimming the head of an skb by calling skb_pull can cause the packet
to become unaligned if the length pulled is odd.  Since the length is
entirely arbitrary for a FIN packet carrying data, this is actually
quite common.

Unaligned data is not the end of the world, but we should avoid it if
it's easily done.  In this case it is trivial.  Since we're discarding
all of the head data it doesn't matter whether we move skb->data forward
or back.

However, it is still possible to have unaligned skb->data in general.
So network drivers should be prepared to handle it instead of crashing.

This patch also adds an unlikely marking on len < headlen since partial
ACKs on head data are extremely rare in the wild.  As the return value
of __pskb_trim_head is no longer ever NULL that has been removed.

Signed-off-by: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-05 15:03:37 -07:00
Stephen Hemminger
fb80a6e1a5 [TCP] tcp_highspeed: Fix problem observed by Xiaoliang (David) Wei
When snd_cwnd is smaller than 38 and the connection is in
congestion avoidance phase (snd_cwnd > snd_ssthresh), the snd_cwnd
seems to stop growing.

The additive increase was confused because C array's are 0 based.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-02 17:51:08 -07:00
Alexey Dobriyan
7114b0bb6d [NETFILTER]: PPTP helper: fix sstate/cstate typo
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-28 22:51:05 -07:00
Patrick McHardy
ca3ba88d0c [NETFILTER]: mark H.323 helper experimental
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-28 22:50:40 -07:00
Marcel Holtmann
6c813c3fe9 [NETFILTER]: Fix small information leak in SO_ORIGINAL_DST (CVE-2006-1343)
It appears that sockaddr_in.sin_zero is not zeroed during
getsockopt(...SO_ORIGINAL_DST...) operation. This can lead
to an information leak (CVE-2006-1343).

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-28 22:50:18 -07:00
Chris Wright
4a06373913 [NETFILTER]: SNMP NAT: fix memleak in snmp_object_decode
If kmalloc fails, error path leaks data allocated from asn1_oid_decode().

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-23 15:15:13 -07:00
Patrick McHardy
4d942d8b39 [NETFILTER]: H.323 helper: fix sequence extension parsing
When parsing unknown sequence extensions the "son"-pointer points behind
the last known extension for this type, don't try to interpret it.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-23 15:15:10 -07:00
Patrick McHardy
7185989db4 [NETFILTER]: H.323 helper: fix parser error propagation
The condition "> H323_ERROR_STOP" can never be true since H323_ERROR_STOP
is positive and is the highest possible return code, while real errors are
negative, fix the checks. Also only abort on real errors in some spots
that were just interpreting any return value != 0 as error.

Fixes crashes caused by use of stale data after a parsing error occured:

BUG: unable to handle kernel paging request at virtual address bfffffff
 printing eip:
c01aa0f8
*pde = 1a801067
*pte = 00000000
Oops: 0000 [#1]
PREEMPT
Modules linked in: ip_nat_h323 ip_conntrack_h323 nfsd exportfs sch_sfq sch_red cls_fw sch_hfsc  xt_length ipt_owner xt_MARK iptable_mangle nfs lockd sunrpc pppoe pppoxx
CPU:    0
EIP:    0060:[<c01aa0f8>]    Not tainted VLI
EFLAGS: 00210646   (2.6.17-rc4 #8)
EIP is at memmove+0x19/0x22
eax: d77264e9   ebx: d77264e9   ecx: e88d9b17   edx: d77264e9
esi: bfffffff   edi: bfffffff   ebp: de6a7680   esp: c0349db8
ds: 007b   es: 007b   ss: 0068
Process asterisk (pid: 3765, threadinfo=c0349000 task=da068540)
Stack: <0>00000006 c0349e5e d77264e3 e09a2b4e e09a38a0 d7726052 d7726124 00000491
       00000006 00000006 00000006 00000491 de6a7680 d772601e d7726032 c0349f74
       e09a2dc2 00000006 c0349e5e 00000006 00000000 d76dda28 00000491 c0349f74
Call Trace:
 [<e09a2b4e>] mangle_contents+0x62/0xfe [ip_nat]
 [<e09a2dc2>] ip_nat_mangle_tcp_packet+0xa1/0x191 [ip_nat]
 [<e0a2712d>] set_addr+0x74/0x14c [ip_nat_h323]
 [<e0ad531e>] process_setup+0x11b/0x29e [ip_conntrack_h323]
 [<e0ad534f>] process_setup+0x14c/0x29e [ip_conntrack_h323]
 [<e0ad57bd>] process_q931+0x3c/0x142 [ip_conntrack_h323]
 [<e0ad5dff>] q931_help+0xe0/0x144 [ip_conntrack_h323]
...

Found by the PROTOS c07-h2250v4 testsuite.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-23 15:15:08 -07:00
Patrick McHardy
f41d5bb1d9 [NETFILTER]: SNMP NAT: fix memory corruption
Fix memory corruption caused by snmp_trap_decode:

- When snmp_trap_decode fails before the id and address are allocated,
  the pointers contain random memory, but are freed by the caller
  (snmp_parse_mangle).

- When snmp_trap_decode fails after allocating just the ID, it tries
  to free both address and ID, but the address pointer still contains
  random memory. The caller frees both ID and random memory again.

- When snmp_trap_decode fails after allocating both, it frees both,
  and the callers frees both again.

The corruption can be triggered remotely when the ip_nat_snmp_basic
module is loaded and traffic on port 161 or 162 is NATed.

Found by multiple testcases of the trap-app and trap-enc groups of the
PROTOS c06-snmpv1 testsuite.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-22 16:55:14 -07:00
Alexey Dobriyan
4195f81453 [NET]: Fix "ntohl(ntohs" bugs
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-22 16:53:22 -07:00
Solar Designer
2c8ac66bb2 [NETFILTER]: Fix do_add_counters race, possible oops or info leak (CVE-2006-0039)
Solar Designer found a race condition in do_add_counters(). The beginning
of paddc is supposed to be the same as tmp which was sanity-checked
above, but it might not be the same in reality. In case the integer
overflow and/or the race condition are triggered, paddc->num_counters
might not match the allocation size for paddc. If the check below
(t->private->number != paddc->num_counters) nevertheless passes (perhaps
this requires the race condition to be triggered), IPT_ENTRY_ITERATE()
would read kernel memory beyond the allocation size, potentially causing
an oops or leaking sensitive data (e.g., passwords from host system or
from another VPS) via counter increments. This requires CAP_NET_ADMIN.

Signed-off-by: Solar Designer <solar@openwall.com>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-19 02:16:52 -07:00
Alexey Dobriyan
a467704dcb [NETFILTER]: GRE conntrack: fix htons/htonl confusion
GRE keys are 16 bit.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-19 02:16:29 -07:00
Philip Craig
5c170a09d9 [NETFILTER]: fix format specifier for netfilter log targets
The prefix argument for nf_log_packet is a format specifier,
so don't pass the user defined string directly to it.

Signed-off-by: Philip Craig <philipc@snapgear.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-19 02:15:47 -07:00
Jesper Juhl
493e2428aa [NETFILTER]: Fix memory leak in ipt_recent
The Coverity checker spotted that we may leak 'hold' in
net/ipv4/netfilter/ipt_recent.c::checkentry() when the following
is true:
  if (!curr_table->status_proc) {
    ...
    if(!curr_table) {
    ...
      return 0;  <-- here we leak.
Simply moving an existing vfree(hold); up a bit avoids the possible leak.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-19 02:15:13 -07:00
Angelo P. Castellani
8872d8e1c4 [TCP]: reno sacked_out count fix
From: "Angelo P. Castellani" <angelo.castellani+lkml@gmail.com>

Using NewReno, if a sk_buff is timed out and is accounted as lost_out,
it should also be removed from the sacked_out.

This is necessary because recovery using NewReno fast retransmit could
take up to a lot RTTs and the sk_buff RTO can expire without actually
being really lost.

left_out = sacked_out + lost_out
in_flight = packets_out - left_out + retrans_out

Using NewReno without this patch, on very large network losses,
left_out becames bigger than packets_out + retrans_out (!!).

For this reason unsigned integer in_flight overflows to 2^32 - something.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-16 21:42:11 -07:00
Wei Yongjun
63cbd2fda3 [IPV4]: ip_options_fragment() has no effect on fragmentation
Fix error point to options in ip_options_fragment(). optptr get a
error pointer to the ipv4 header, correct is pointer to ipv4 options.

Signed-off-by: Wei Yongjun <weiyj@soft.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-09 15:18:50 -07:00
Hua Zhong
0182bd2b1e [IPV4]: Remove likely in ip_rcv_finish()
This is another result from my likely profiling tool
(dwalker@mvista.com just sent the patch of the profiling tool to
linux-kernel mailing list, which is similar to what I use).

On my system (not very busy, normal development machine within a
VMWare workstation), I see a 6/5 miss/hit ratio for this "likely".

Signed-off-by: Hua Zhong <hzhong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-06 18:11:39 -07:00
John Heffner
5528e568a7 [TCP]: Fix snd_cwnd adjustments in tcp_highspeed.c
Xiaoliang (David) Wei wrote:
> Hi gurus,
> 
>    I am reading the code of tcp_highspeed.c in the kernel and have a
> question on the hstcp_cong_avoid function, specifically the following
> AI part (line 136~143 in net/ipv4/tcp_highspeed.c ):
> 
>                /* Do additive increase */
>                if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
>                        tp->snd_cwnd_cnt += ca->ai;
>                        if (tp->snd_cwnd_cnt >= tp->snd_cwnd) {
>                                tp->snd_cwnd++;
>                                tp->snd_cwnd_cnt -= tp->snd_cwnd;
>                        }
>                }
> 
>    In this part, when (tp->snd_cwnd_cnt == tp->snd_cwnd),
> snd_cwnd_cnt will be -1... snd_cwnd_cnt is defined as u16, will this
> small chance of getting -1 becomes a problem?
> Shall we change it by reversing the order of the cwnd++ and cwnd_cnt -= 
> cwnd?

Absolutely correct.  Thanks.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-05 17:41:44 -07:00
Herbert Xu
75c2d9077c [TCP]: Fix sock_orphan dead lock
Calling sock_orphan inside bh_lock_sock in tcp_close can lead to dead
locks.  For example, the inet_diag code holds sk_callback_lock without
disabling BH.  If an inbound packet arrives during that admittedly tiny
window, it will cause a dead lock on bh_lock_sock.  Another possible
path would be through sock_wfree if the network device driver frees the
tx skb in process context with BH enabled.

We can fix this by moving sock_orphan out of bh_lock_sock.

The tricky bit is to work out when we need to destroy the socket
ourselves and when it has already been destroyed by someone else.

By moving sock_orphan before the release_sock we can solve this
problem.  This is because as long as we own the socket lock its
state cannot change.

So we simply record the socket state before the release_sock
and then check the state again after we regain the socket lock.
If the socket state has transitioned to TCP_CLOSE in the time being,
we know that the socket has been destroyed.  Otherwise the socket is
still ours to keep.

Note that I've also moved the increment on the orphan count forward.
This may look like a problem as we're increasing it even if the socket
is just about to be destroyed where it'll be decreased again.  However,
this simply enlarges a window that already exists.  This also changes
the orphan count test by one.

Considering what the orphan count is meant to do this is no big deal.

This problem was discoverd by Ingo Molnar using his lock validator.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-03 23:31:35 -07:00
Patrick McHardy
7800007c1e [NETFILTER]: x_tables: don't use __copy_{from,to}_user on unchecked memory in compat layer
Noticed by Linus Torvalds <torvalds@osdl.org>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-03 23:20:27 -07:00
Jing Min Zhao
7582e9d17e [NETFILTER]: H.323 helper: Change author's email address
Signed-off-by: Jing Min Zhao <zhaojingmin@users.sourceforge.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-03 23:19:59 -07:00
Patrick McHardy
2354feaeb2 [NETFILTER]: NAT: silence unused variable warnings with CONFIG_XFRM=n
net/ipv4/netfilter/ip_nat_standalone.c: In function 'ip_nat_out':
net/ipv4/netfilter/ip_nat_standalone.c:223: warning: unused variable 'ctinfo'
net/ipv4/netfilter/ip_nat_standalone.c:222: warning: unused variable 'ct'

Surprisingly no complaints so far ..

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-05-03 23:19:26 -07:00