Commit graph

15115 commits

Author SHA1 Message Date
Brian Haley
793b147316 IPv6: data structure changes for new socket options
Add underlying data structure changes and basic setsockopt()
and getsockopt() support for IPV6_RECVPATHMTU, IPV6_PATHMTU,
and IPV6_DONTFRAG.  IPV6_PATHMTU is actually fully functional
at this point.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-23 23:35:28 -07:00
Jiri Pirko
3a73702863 l2tp_eth: fix memory allocation
Since .size is set properly in "struct pernet_operations l2tp_eth_net_ops",
allocating space for "struct l2tp_eth_net" by hand is not correct, even causes
memory leakage.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-23 16:37:33 -07:00
Jiri Pirko
e773aaff82 l2tp: fix memory allocation
Since .size is set properly in "struct pernet_operations l2tp_net_ops",
allocating space for "struct l2tp_net" by hand is not correct, even causes
memory leakage.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-23 16:37:32 -07:00
John W. Linville
3b51cc996e Merge branch 'master' into for-davem
Conflicts:
	drivers/net/wireless/ath/ath9k/phy.c
	drivers/net/wireless/iwlwifi/iwl-6000.c
	drivers/net/wireless/iwlwifi/iwl-debugfs.c
2010-04-23 14:43:45 -04:00
Andrew Hendry
5ebfbc06aa X25: Add if_x25.h and x25 to device identifiers
V2 Feedback from John Hughes.
- Add header for userspace implementations such as xot/xoe to use
- Use explicit values for interface stability
- No changes to driver patches

V1
- Use identifiers instead of magic numbers for X25 layer 3 to device interface.
- Also fixed checkpatch notes on updated code.

[ Add new user header to include/linux/Kbuild  -DaveM ]

Signed-off-by: Andrew Hendry <andrew.hendry@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22 16:12:36 -07:00
Paul LeoNerd Evans
40eaf96271 net: Socket filter ancilliary data access for skb->dev->type
Add an SKF_AD_HATYPE field to the packet ancilliary data area, giving
access to skb->dev->type, as reported in the sll_hatype field.

When capturing packets on a PF_PACKET/SOCK_RAW socket bound to all
interfaces, there doesn't appear to be a way for the filter program to
actually find out the underlying hardware type the packet was captured
on. This patch adds such ability.

This patch also handles the case where skb->dev can be NULL, such as on
netlink sockets.

Signed-off-by: Paul Evans <leonerd@leonerd.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22 16:05:44 -07:00
Tom Herbert
aa2ea0586d tcp: fix outsegs stat for TSO segments
Account for TSO segments of an skb in TCP_MIB_OUTSEGS counter.  Without
doing this, the counter can be off by orders of magnitude from the
actual number of segments sent.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22 16:00:00 -07:00
Stephen Hemminger
e802af9cab IPv6: Generic TTL Security Mechanism (final version)
This patch adds IPv6 support for RFC5082 Generalized TTL Security Mechanism.  

Not to users of mapped address; the IPV6 and IPV4 socket options are seperate.
The server does have to deal with both IPv4 and IPv6 socket options
and the client has to handle the different for each family.

On client:
	int ttl = 255;
	getaddrinfo(argv[1], argv[2], &hint, &result);

	for (rp = result; rp != NULL; rp = rp->ai_next) {
		s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
		if (s < 0) continue;

		if (rp->ai_family == AF_INET) {
			setsockopt(s, IPPROTO_IP, IP_TTL, &ttl, sizeof(ttl));
		} else if (rp->ai_family == AF_INET6) {
			setsockopt(s, IPPROTO_IPV6,  IPV6_UNICAST_HOPS, 
					&ttl, sizeof(ttl)))
		}
			
		if (connect(s, rp->ai_addr, rp->ai_addrlen) == 0) {
		   ...

On server:
	int minttl = 255 - maxhops;
   
	getaddrinfo(NULL, port, &hints, &result);
	for (rp = result; rp != NULL; rp = rp->ai_next) {
		s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
		if (s < 0) continue;

		if (rp->ai_family == AF_INET6)
			setsockopt(s, IPPROTO_IPV6,  IPV6_MINHOPCOUNT,
					&minttl, sizeof(minttl));
		setsockopt(s, IPPROTO_IP, IP_MINTTL, &minttl, sizeof(minttl));
			
		if (bind(s, rp->ai_addr, rp->ai_addrlen) == 0)
			break
...

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22 15:24:53 -07:00
David S. Miller
9ccb897594 net: Orphan and de-dst skbs earlier in xmit path.
This way GSO packets don't get handled differently.

With help from Eric Dumazet.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
2010-04-22 01:02:07 -07:00
Eric Dumazet
e326bed2f4 rps: immediate send IPI in process_backlog()
If some skb are queued to our backlog, we are delaying IPI sending at
the end of net_rx_action(), increasing latencies. This defeats the
queueing, since we want to quickly dispatch packets to the pool of
worker cpus, then eventually deeply process our packets.

It's better to send IPI before processing our packets in upper layers,
from process_backlog().

Change the _and_disable_irq suffix to _and_enable_irq(), since we enable
local irq in net_rps_action(), sorry for the confusion.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22 00:22:45 -07:00
Johannes Berg
b002a86109 ethernet: print protocol in host byte order
Eric's recent patch added __force, but this
place would seem to require actually doing
a byte order conversion so the printk is
consistent across architectures.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21 22:57:19 -07:00
Eric Dumazet
9a20e3197e net: Introduce skb_orphan_try()
At this point, skb->destructor is not the original one (stored in
DEV_GSO_CB(skb)->destructor)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21 22:54:08 -07:00
Eric Dumazet
989a297920 fasync: RCU and fine grained locking
kill_fasync() uses a central rwlock, candidate for RCU conversion, to
avoid cache line ping pongs on SMP.

fasync_remove_entry() and fasync_add_entry() can disable IRQS on a short
section instead during whole list scan.

Use a spinlock per fasync_struct to synchronize kill_fasync_rcu() and
fasync_{remove|add}_entry(). This spinlock is IRQ safe, so sock_fasync()
doesnt need its own implementation and can use fasync_helper(), to
reduce code size and complexity.

We can remove __kill_fasync() direct use in net/socket.c, and rename it
to kill_fasync_rcu().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21 16:19:29 -07:00
David S. Miller
e5700aff14 tcp: Mark v6 response packets as CHECKSUM_PARTIAL
Otherwise we only get the checksum right for data-less TCP responses.

Noticed by Herbert Xu.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21 14:59:20 -07:00
David S. Miller
f71b70e115 tcp: Fix ipv6 checksumming on response packets for real.
Commit 6651ffc8e8
("ipv6: Fix tcp_v6_send_response transport header setting.")
fixed one half of why ipv6 tcp response checksums were
invalid, but it's not the whole story.

If we're going to use CHECKSUM_PARTIAL for these things (which we are
since commit 2e8e18ef52 "tcp: Set
CHECKSUM_UNNECESSARY in tcp_init_nondata_skb"), we can't be setting
buff->csum as we always have been here in tcp_v6_send_response.  We
need to leave it at zero.

Kill that line and checksums are good again.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21 01:57:01 -07:00
David S. Miller
87eb367003 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-6000.c
	net/core/dev.c
2010-04-21 01:14:25 -07:00
David Howells
05d17608a6 net: Fix an RCU warning in dev_pick_tx()
Fix the following RCU warning in dev_pick_tx():

===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
net/core/dev.c:1993 invoked rcu_dereference_check() without protection!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
2 locks held by swapper/0:
 #0:  (&idev->mc_ifc_timer){+.-...}, at: [<ffffffff81039e65>] run_timer_softirq+0x17b/0x278
 #1:  (rcu_read_lock_bh){.+....}, at: [<ffffffff812ea3eb>] dev_queue_xmit+0x14e/0x4dc

stack backtrace:
Pid: 0, comm: swapper Not tainted 2.6.34-rc5-cachefs #4
Call Trace:
 <IRQ>  [<ffffffff810516c4>] lockdep_rcu_dereference+0xaa/0xb2
 [<ffffffff812ea4f6>] dev_queue_xmit+0x259/0x4dc
 [<ffffffff812ea3eb>] ? dev_queue_xmit+0x14e/0x4dc
 [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff81035362>] ? local_bh_enable_ip+0xbc/0xc1
 [<ffffffff812f0954>] neigh_resolve_output+0x24b/0x27c
 [<ffffffff8134f673>] ip6_output_finish+0x7c/0xb4
 [<ffffffff81350c34>] ip6_output2+0x256/0x261
 [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff813517fb>] ip6_output+0xbbc/0xbcb
 [<ffffffff8135bc5d>] ? fib6_force_start_gc+0x2b/0x2d
 [<ffffffff81368acb>] mld_sendpack+0x273/0x39d
 [<ffffffff81368858>] ? mld_sendpack+0x0/0x39d
 [<ffffffff81052099>] ? mark_held_locks+0x52/0x70
 [<ffffffff813692fc>] mld_ifc_timer_expire+0x24f/0x288
 [<ffffffff81039ed6>] run_timer_softirq+0x1ec/0x278
 [<ffffffff81039e65>] ? run_timer_softirq+0x17b/0x278
 [<ffffffff813690ad>] ? mld_ifc_timer_expire+0x0/0x288
 [<ffffffff81035531>] ? __do_softirq+0x69/0x140
 [<ffffffff8103556a>] __do_softirq+0xa2/0x140
 [<ffffffff81002e0c>] call_softirq+0x1c/0x28
 [<ffffffff81004b54>] do_softirq+0x38/0x80
 [<ffffffff81034f06>] irq_exit+0x45/0x47
 [<ffffffff810177c3>] smp_apic_timer_interrupt+0x88/0x96
 [<ffffffff810028d3>] apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffff810488dd>] ? __atomic_notifier_call_chain+0x0/0x86
 [<ffffffff810096bf>] ? mwait_idle+0x6e/0x78
 [<ffffffff810096b6>] ? mwait_idle+0x65/0x78
 [<ffffffff810011cb>] cpu_idle+0x4d/0x83
 [<ffffffff81380b05>] rest_init+0xb9/0xc0
 [<ffffffff81380a4c>] ? rest_init+0x0/0xc0
 [<ffffffff8168dcf0>] start_kernel+0x392/0x39d
 [<ffffffff8168d2a3>] x86_64_start_reservations+0xb3/0xb7
 [<ffffffff8168d38b>] x86_64_start_kernel+0xe4/0xeb

An rcu_dereference() should be an rcu_dereference_bh().

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21 01:09:44 -07:00
David S. Miller
e04997b13a Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ 2010-04-21 00:50:39 -07:00
Herbert Xu
6651ffc8e8 ipv6: Fix tcp_v6_send_response transport header setting.
My recent patch to remove the open-coded checksum sequence in
tcp_v6_send_response broke it as we did not set the transport
header pointer on the new packet.

Actually, there is code there trying to set the transport
header properly, but it sets it for the wrong skb ('skb'
instead of 'buff').

This bug was introduced by commit
a8fdf2b331 ("ipv6: Fix
tcp_v6_send_response(): it didn't set skb transport header")

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-21 00:47:15 -07:00
Rami Rosen
ccb7c7732e net: Remove two unnecessary exports (skbuff).
There is no need to export skb_under_panic() and skb_over_panic() in
skbuff.c, since these methods are used only in skbuff.c ; this patch
removes these two exports. It also marks these functions as 'static'
and removeS the extern declarations of them from
include/linux/skbuff.h

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 22:39:53 -07:00
Eric Dumazet
0eae88f31c net: Fix various endianness glitches
Sparse can help us find endianness bugs, but we need to make some
cleanups to be able to more easily spot real bugs.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 19:06:52 -07:00
Eric Dumazet
8eabf95cb1 bridge: add a missing ntohs()
grec_nsrcs is in network order, we should convert to host horder in
br_multicast_igmp3_report()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 18:51:57 -07:00
David S. Miller
e46754f8c9 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6 2010-04-20 17:57:56 -07:00
Eric Dumazet
aa39514516 net: sk_sleep() helper
Define a new function to return the waitqueue of a "struct sock".

static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
	return sk->sk_sleep;
}

Change all read occurrences of sk_sleep by a call to this function.

Needed for a future RCU conversion. sk_sleep wont be a field directly
available.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 16:37:13 -07:00
Juuso Oikarinen
7bdfcaaff5 mac80211: Fix ieee80211_sta_conn_mon_timer with hw connection monitoring
When IEEE80211_HW_CONNECTION_MONITOR is configured by the driver, starting
of ieee80211_sta_conn_mon_timer should be prevented, as it is then not needed.

This is currently partially the case. As it seems, when a probe-response is
received from the AP the timer is still restarted, thus restarting the host
based connection keep-alive mechanism. These probe-responses happen at least
when scanning while associated.

Fix this by preventing starting of the ieee80211_sta_conn_mon_timer in the
ieee80211_rx_mgmt_probe_resp function.

Signed-off-by: Juuso Oikarinen <juuso.oikarinen@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-04-20 11:52:40 -04:00
Holger Schurig
1289723ef2 mac80211: sample survey implementation for mac80211 & hwsim
This adds the survey function to both mac80211 itself and to mac80211_hwsim.
For the latter driver, we simply invent some noise level.A real driver which
cannot determine the real channel noise MUST NOT report any noise, especially
not a magically conjured one :-)

Signed-off-by: Holger Schurig <holgerschurig@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-04-20 11:50:52 -04:00
Daniel Yingqiang Ma
03ceedea97 ath9k: Group Key fix for VAPs
When I set up multiple VAPs with ath9k, I encountered an issue that
the traffic may be lost after a while.

The detailed phenomenon is
1. After a while the clients connected to one of these VAPs will get
into a state that no broadcast/multicast packets can be transfered
successfully while the unicast packets can be transfered normally.
2. Minutes latter the unitcast packets transfer will fail as well,
because the ARP entry is expired and it can't be freshed due to the
broadcast trouble.

It's caused by the group key overwritten and someone discussed this
issue in ath9k-devel maillist before, but haven't work out a fix yet.

I referred the method in madwifi, and made a patch for ath9k.
The method is to set the high bit of the sender(AP)'s address, and
associated that mac and the group key. It requires the hardware
supports multicast frame key search. It seems true for AR9160.

Not sure whether it's the correct way to fix this issue. But it seems
to work in my test. The patch is attached, feel free to revise it.

Signed-off-by: Daniel Yingqiang ma <yma.cool@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-04-20 11:50:51 -04:00
Jiri Pirko
ab9304717f net: emphasize rtnl lock required in call_netdevice_notifiers
Since netdev_chain is guarded by rtnl_lock, ASSERT_RTNL should be
present here to make sure that all callers of call_netdevice_notifiers
does the locking properly.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 01:45:37 -07:00
Eric Dumazet
b249dcb82d rps: consistent rxhash
In case we compute a software skb->rxhash, we can generate a consistent
hash : Its value will be the same in both flow directions.

This helps some workloads, like conntracking, since the same state needs
to be accessed in both directions.

tbench + RFS + this patch gives better results than tbench with default
kernel configuration (no RPS, no RFS)

Also fixed some sparse warnings.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 01:18:06 -07:00
Eric Dumazet
e36fa2f7e9 rps: cleanups
struct softnet_data holds many queues, so consistent use "sd" name
instead of "queue" is better.

Adds a rps_ipi_queued() helper to cleanup enqueue_to_backlog()

Adds a _and_irq_disable suffix to net_rps_action() name, as David
suggested.

incr_input_queue_head() becomes input_queue_head_incr()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 01:18:05 -07:00
Eric Dumazet
f5acb907dc rps: static functions
store_rps_map() & store_rps_dev_flow_table_cnt() are static.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-19 14:40:57 -07:00
Johannes Berg
67e0f39277 mac80211: add missing newline
One HT debugging printk is missing a newline,
add it.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-04-19 16:45:20 -04:00
Juuso Oikarinen
3393a608c4 mac80211: Prevent running sta_cleanup timer unnecessarily
The sta_cleanup timer is used to periodically expire buffered frames from the
tx buf. The timer is executing periodically, regardless of the need for it.
This is wasting resources.

Fix this simply by not restarting the sta_cleanup timer if the tx buffer was
empty. Restart the timer when there is some more tx-traffic.

Cc: Janne Ylälehto <janne.ylalehto@nokia.com>
Signed-off-by: Juuso Oikarinen <juuso.oikarinen@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-04-19 16:41:42 -04:00
Johannes Berg
2aab4c273a mac80211: fix stopping RX BA session from timer
Kalle reported that his system deadlocks since my
recent work in this area. The reason quickly became
apparent: we try to cancel_timer_sync() a timer
from within itself. Fix that by making the function
aware of the context it is called from.

Reported-by: Kalle Valo <kvalo@adurom.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Tested-by: Kalle Valo <kvalo@adurom.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-04-19 16:41:42 -04:00
Reinette Chatre
fe6f212ce1 mac80211: pass HT changes to driver when off channel
Since "mac80211: make off-channel work generic" drivers have not been
notified of configuration changes after association or authentication. This
caused more dependence on current state to ensure driver will be notified
when configuration changes occur. One such problem arises if off-channel is
in progress when HT information changes. Since HT is only enabled on the
"oper_channel" the driver will never be notified of this change. Usually
the driver is notified soon after of a BSS information change
(BSS_CHANGED_HT) ... but since the driver did not get a notification that
this is a HT channel the new BSS information does not make sense.

Fix this by also changing the off-channel information when HT is enabled
and thus cause driver to be notified correctly.

This fixes a problem in 4965 when associated with 5GHz 40MHz channel.
Without this patch the system can associate but is unable to transfer any
data, not even ping.

See http://bugzilla.intellinuxwireless.org/show_bug.cgi?id=2158

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-04-19 16:34:11 -04:00
Johannes Berg
b4bb5c3fd9 mac80211: remove bogus TX agg state assignment
When the addba timer expires but has no work to do,
it should not affect the state machine. If it does,
TX will not see the successfully established and we
can also crash trying to re-establish the session.

Cc: stable@kernel.org [2.6.32, 2.6.33]
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-04-19 16:34:11 -04:00
Eric Dumazet
88751275b8 rps: shortcut net_rps_action()
net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
RPS is not active.

Tom Herbert used two bitmasks to hold information needed to send IPI,
but a single LIFO list seems more appropriate.

Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
(remove two ifdefs)

Move rps_remote_softirq_cpus into softnet_data to share its first cache
line, filling an existing hole.

In a future patch, we could call net_rps_action() from process_backlog()
to make sure we send IPI before handling this cpu backlog.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-19 13:20:34 -07:00
Linus Torvalds
375db4810b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  gigaset: include cleanup cleanup
  packet : remove init_net restriction
  WAN: flush tx_queue in hdlc_ppp to prevent panic on rmmod hw_driver.
  ip: Fix ip_dev_loopback_xmit()
  net: dev_pick_tx() fix
  fib: suppress lockdep-RCU false positive in FIB trie.
  tun: orphan an skb on tx
  forcedeth: fix tx limit2 flag check
  iwlwifi: work around bogus active chains detection
2010-04-19 07:27:45 -07:00
Eric Dumazet
fc6055a5ba net: Introduce skb_orphan_try()
Transmitted skb might be attached to a socket and a destructor, for
memory accounting purposes.

Traditionally, this destructor is called at tx completion time, when skb
is freed.

When tx completion is performed by another cpu than the sender, this
forces some cache lines to change ownership. XPS was an attempt to give
tx completion to initial cpu.

David idea is to call destructor right before giving skb to device (call
to ndo_start_xmit()). Because device queues are usually small, orphaning
skb before tx completion is not a big deal. Some drivers already do
this, we could do it in upper level.

There is one known exception to this early orphaning, called tx
timestamping. It needs to keep a reference to socket until device can
give a hardware or software timestamp.

This patch adds a skb_orphan_try() helper, to centralize all exceptions
to early orphaning in one spot, and use it in dev_hard_start_xmit().

"tbench 16" results on a Nehalem machine (2 X5570  @ 2.93GHz)
before: Throughput 4428.9 MB/sec 16 procs
after: Throughput 4448.14 MB/sec 16 procs

UDP should get even better results, its destructor being more complex,
since SOCK_USE_WRITE_QUEUE is not set (four atomic ops instead of one)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-18 02:39:41 -07:00
Eric Dumazet
9958da0501 net: remove time limit in process_backlog()
- There is no point to enforce a time limit in process_backlog(), since
other napi instances dont follow same rule. We can exit after only one
packet processed...
The normal quota of 64 packets per napi instance should be the norm, and
net_rx_action() already has its own time limit.
Note : /proc/net/core/dev_weight can be used to tune this 64 default
value.

- Use DEFINE_PER_CPU_ALIGNED for softnet_data definition.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-18 02:36:13 -07:00
Eric Dumazet
8770acf049 rps: rps_sock_flow_table is mostly read
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-17 00:54:36 -07:00
Tom Herbert
fec5e652e5 rfs: Receive Flow Steering
This patch implements receive flow steering (RFS).  RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running.  RFS is an
extension of Receive Packet Steering (RPS).

The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure.  The rxhash is passed in skb's received on
the connection from netif_receive_skb.  For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.

The convolution of the simple approach is that it would potentially
allow OOO packets.  If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.

To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.

rps_sock_table is a global hash table.  Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.

rps_dev_flow_table is specific to each device queue.  Each entry
contains a CPU and a tail queue counter.  The CPU is the "current"
CPU for a matching flow.  The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.

Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length.  When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.

And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted.  When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:

- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table.  This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.

Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality.  2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.

This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.

There are two configuration parameters for RFS.  The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue.  Both are rounded to power of two.

The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).

The benefits of RFS are dependent on cache hierarchy, application
load, and other factors.  On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation.  However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp.  The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.

e1000e on 8 core Intel
   No RFS or RPS		104K tps at 30% CPU
   No RFS (best RPS config):    290K tps at 63% CPU
   RFS				303K tps at 61% CPU

RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
  No RFS/RPS	103K	48%	757/900/3185		4472.35
  RPS only:	174K	73%	415/993/2468		491.66
  RFS		223K	73%	379/651/1382		315.61

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
Daniel Lezcano
1c4f019732 packet : remove init_net restriction
The af_packet protocol is used by Perl to do ioctls as reported by
Stephane Riviere:

"Net::RawIP relies on SIOCGIFADDR et SIOCGIFHWADDR to get the IP and MAC
addresses of the network interface."

But in a new network namespace these ioctl fail because it is disabled for
a namespace different from the init_net_ns.

These two lines should not be there as af_inet and af_packet are
namespace aware since a long time now. I suppose we forget to remove these
lines because we sent the af_packet first, before af_inet was supported.

Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Reported-by: Stephane Riviere <stephane.riviere@regis-dgac.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 15:41:04 -07:00
Nishant Sarmukadam
7834704be4 cfg80211: Avoid sending IWEVASSOCREQIE and IWEVASSOCRESPIE events with NULL event body
In a scenario, where a cfg80211 driver (station mode) does not send assoc request
and assoc response IEs in cfg80211_connect_result after a successful association
to an AP, cfg80211 sends IWEVASSOCREQIE and IWEVASSOCRESPIE to the user space
application with NULL data. This can cause an issue at the event recipient.

An example of this is when cfg80211 sends IWEVASSOCREQIE and IWEVASSOCRESPIE
events with NULL event body to wpa_supplicant. The wpa_supplicant overwrites
the assoc request and assoc response IEs for this station with NULL data.
If the association is WPA/WPA2, the wpa_supplicant is not able to generate
EAPOL handshake messages, since the IEs are NULL.

With the patch, req_ie and resp_ie will be NULL by avoiding the
assignment if the driver has not sent the IEs to cfg80211. The event sending
code sends the events only if resp_ie and req_ie are not NULL. This
will ensure that the events are not sent with NULL event body.

Signed-off-by: Nishant Sarmukadam <nishants@marvell.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-04-16 15:32:00 -04:00
Shan Wei
b5d4399823 ipv6: fix the comment of ip6_xmit()
ip6_xmit() is used by upper transport protocol.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-15 23:36:38 -07:00
Shan Wei
4e15ed4d93 net: replace ipfragok with skb->local_df
As Herbert Xu said: we should be able to simply replace ipfragok
with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function)
has droped ipfragok and set local_df value properly.

The patch kills the ipfragok parameter of .queue_xmit().

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-15 23:36:37 -07:00
Shan Wei
0eecb78494 ipv6: cancel to setting local_df in ip6_xmit()
commit f88037(sctp: Drop ipfargok in sctp_xmit function)
has droped ipfragok and set local_df value properly.

So the change of commit 77e2f1(ipv6: Fix ip6_xmit to
send fragments if ipfragok is true) is not needed.
So the patch remove them.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-15 23:36:37 -07:00
Joe Perches
a4fbf8415c net/l2tp/l2tp_debugfs.c: Convert NIPQUAD to %pI4
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-15 15:37:13 -07:00
David S. Miller
3eb14b944f Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 2010-04-15 14:31:06 -07:00
Eric Dumazet
e30b38c298 ip: Fix ip_dev_loopback_xmit()
Eric Paris got following trace with a linux-next kernel

[   14.203970] BUG: using smp_processor_id() in preemptible [00000000]
code: avahi-daemon/2093
[   14.204025] caller is netif_rx+0xfa/0x110
[   14.204035] Call Trace:
[   14.204064]  [<ffffffff81278fe5>] debug_smp_processor_id+0x105/0x110
[   14.204070]  [<ffffffff8142163a>] netif_rx+0xfa/0x110
[   14.204090]  [<ffffffff8145b631>] ip_dev_loopback_xmit+0x71/0xa0
[   14.204095]  [<ffffffff8145b892>] ip_mc_output+0x192/0x2c0
[   14.204099]  [<ffffffff8145d610>] ip_local_out+0x20/0x30
[   14.204105]  [<ffffffff8145d8ad>] ip_push_pending_frames+0x28d/0x3d0
[   14.204119]  [<ffffffff8147f1cc>] udp_push_pending_frames+0x14c/0x400
[   14.204125]  [<ffffffff814803fc>] udp_sendmsg+0x39c/0x790
[   14.204137]  [<ffffffff814891d5>] inet_sendmsg+0x45/0x80
[   14.204149]  [<ffffffff8140af91>] sock_sendmsg+0xf1/0x110
[   14.204189]  [<ffffffff8140dc6c>] sys_sendmsg+0x20c/0x380
[   14.204233]  [<ffffffff8100ad82>] system_call_fastpath+0x16/0x1b

While current linux-2.6 kernel doesnt emit this warning, bug is latent
and might cause unexpected failures.

ip_dev_loopback_xmit() runs in process context, preemption enabled, so
must call netif_rx_ni() instead of netif_rx(), to make sure that we
process pending software interrupt.

Same change for ip6_dev_loopback_xmit()

Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-15 14:25:22 -07:00