Commit graph

2713 commits

Author SHA1 Message Date
Johannes Berg
11a28d373e net: make namespace iteration possible under RCU
All we need to take care of is using proper RCU list
add/del primitives and inserting a synchronize_rcu()
at one place to make sure the exit notifiers are run
after everybody has stopped iterating the list.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-12 14:03:25 -07:00
Johannes Berg
667503ddcb cfg80211: fix locking
Over time, a lot of locking issues have crept into
the smarts of cfg80211, so e.g. scan completion can
race against a new scan, IBSS join can race against
leaving an IBSS, etc.

Introduce a new per-interface lock that protects
most of the per-interface data that we need to keep
track of, and sprinkle assertions about that lock
everywhere. Some things now need to be offloaded to
work structs so that we don't require being able to
sleep in functions the drivers call. The exception
to that are the MLME callbacks (rx_auth etc.) that
currently only mac80211 calls because it was easier
to do that there instead of in cfg80211, and future
drivers implementing those calls will, if they ever
exist, probably need to use a similar scheme like
mac80211 anyway...

In order to be able to handle _deauth and _disassoc
properly, introduce a cookie passed to it that will
determine locking requirements.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:02:32 -04:00
Johannes Berg
cb0b4beb93 cfg80211: mlme API must be able to sleep
After the mac80211 mlme cleanup, we can require that
the MLME functions in cfg80211 can sleep. This will
simplify future work in cfg80211 a lot.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:02:31 -04:00
Johannes Berg
c238c8ac63 cfg80211: dont use union for wext
Otherwise it becomes very hard to reset the structs
correctly since wext can be configured while the
interface is down.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:02:31 -04:00
Johannes Berg
3e5d7649a6 cfg80211: let SME control reassociation vs. association
Since we don't really know that well in the kernel,
let's let the SME control whether it wants to use
reassociation or not, by allowing it to give the
previous BSSID in the associate() parameters.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:02:30 -04:00
Johannes Berg
19957bb399 cfg80211: keep track of BSSes
In order to avoid problems with BSS structs going away
while they're in use, I've long wanted to make cfg80211
keep track of them. Without the SME, that wasn't doable
but now that we have the SME we can do this too. It can
keep track of up to four separate authentications and
one association, regardless of whether it's controlled
by the cfg80211 SME or the userspace SME.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:01:53 -04:00
Johannes Berg
517357c685 cfg80211: assimilate and export ieee80211_bss_get_ie
This function from mac80211 seems generally useful, and
I will need it in cfg80211 soon.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:01:53 -04:00
Johannes Berg
8990646d2f cfg80211: implement get_wireless_stats
By dropping the noise reporting, we can implement
wireless stats in cfg80211. We also make the
handler return NULL if we have no information,
which is possible thanks to the recent wext change.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:01:52 -04:00
Johannes Berg
9930380f0b cfg80211: implement IWRATE
For now, let's implement that using a very hackish way:
simply mirror the wext API in the cfg80211 API. This
will have to be changed later when we implement proper
bitrate API.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:01:52 -04:00
Johannes Berg
ab737a4f7d cfg80211: implement IWAP for WDS
This implements siocsiwap/giwap for WDS mode.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:01:52 -04:00
Johannes Berg
bc92afd920 cfg80211: implement iwpower
Just on/off and timeout, and with a hacky cfg80211 method
until we figure out what we want, though this is probably
sufficient as we want to use pm_qos for wifi everywhere.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:01:51 -04:00
Johannes Berg
f21293549f cfg80211: managed mode wext compatibility
This adds code to make it possible to use the cfg80211
connect() API with wireless extensions, and because the
previous patch added emulation of that API with auth()
and assoc(), by extension also supports wext on that.
At the same time, removes code from mac80211 for wext,
but doesn't yet clean up mac80211's mlme code more.

Signed-off-by: Samuel Ortiz <samuel.ortiz@intel.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:01:51 -04:00
Johannes Berg
6829c878ec cfg80211: emulate connect with auth/assoc
This adds code to cfg80211 so that drivers (mac80211 right
now) that don't implement connect but rather auth/assoc can
still be used with the nl80211 connect command. This will
also be necessary for the wext compat code.

Signed-off-by: Samuel Ortiz <samuel.ortiz@intel.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:01:51 -04:00
Samuel Ortiz
b23aa676ab cfg80211: connect/disconnect API
This patch introduces the cfg80211 connect/disconnect API.
The goal here is to run the AUTH and ASSOC steps in one call.
This is needed for some fullmac cards that run both steps
directly from the target, after the host driver sends a
connect command.

Additionally, all the new crypto parameters for connect()
are now also valid for associate() -- although associate
requires the IEs to be used, the information can be useful
for drivers and should be given.

Signed-off-by: Samuel Ortiz <samuel.ortiz@intel.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:01:51 -04:00
Johannes Berg
aff89a9b90 cfg80211: introduce nl80211 testmode command
This introduces a new NL80211_CMD_TESTMODE for testing
and calibration use with nl80211. There's no multiplexing
like like iwpriv had, and the command is not available by
default, it needs to be explicitly enabled in Kconfig and
shouldn't be enabled in most kernels.

The command requires a wiphy index or interface index to
identify the device to operate on, and the new TESTDATA
attribute. There also is API for sending replies to the
command, and testmode multicast messages (on a testmode
multicast group).

I've also updated mac80211 to be able to pass through the
command to the driver, since it itself doesn't implement
the testmode command.

Additionally, to give people an idea of how to use the
command, I've added a little code to hwsim that makes use
of the new command to set the powersave mode, this is
currently done via debugfs and should remain there, and
the testmode command only serves as an example of how to
use this best -- with nested netlink attributes in the
TESTDATA attribute. A hwsim testmode tool can be found at
http://git.sipsolutions.net/hwsim.git/. This tool is BSD
licensed so people can easily use it as a basis for their
own internal fabrication and validation tools.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:01:50 -04:00
Johannes Berg
5121ea0481 wext: constify extra argument to wireless_send_event
This is never changed by the function, so can be marked const.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:01:49 -04:00
Johannes Berg
7ebbe6bd51 cfg80211: remove wireless_dev->bssid
This variable isn't necessary -- the wext code keeps
track of the BSSID itself, and otherwise we have
current_bss.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:01:49 -04:00
Johannes Berg
e6d6e3420d cfg80211: use proper allocation flags
Instead of hardcoding GFP_ATOMIC everywhere, add a
new function parameter that gets the flags from the
caller. Obviously then I need to update all callers
(all of them in mac80211), and it turns out that now
it's ok to use GFP_KERNEL in almost all places.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:01:49 -04:00
David Kilroy
f1f74825fe cfg80211: add wrapper function to get wiphy from priv pointer
Signed-off-by: David Kilroy <kilroyd@googlemail.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 15:01:42 -04:00
Johannes Berg
f1d58c2521 mac80211: push rx status into skb->cb
Within mac80211, we often need to copy the rx status into
skb->cb. This is wasteful, as drivers could be building it
in there to start with. This patch changes the API so that
drivers are expected to pass the RX status in skb->cb, now
accessible as IEEE80211_SKB_RXCB(skb). It also updates all
drivers to pass the rx status in there, but only by making
them memcpy() it into place before the call to the receive
function (ieee80211_rx(_irqsafe)). Each driver can now be
optimised on its own schedule.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 14:57:54 -04:00
Johannes Berg
e36d56b648 cfg80211: pass netdev to change_virtual_intf
If there was a reason I'm passing the ifidx I cannot
remember it any more and don't see one now, so let's
just pass the pointer itself.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-07-10 14:57:38 -04:00
David S. Miller
e5a8a896f5 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2009-07-09 20:18:24 -07:00
Jiri Olsa
ad46276952 memory barrier: adding smp_mb__after_lock
Adding smp_mb__after_lock define to be used as a smp_mb call after
a lock.

Making it nop for x86, since {read|write|spin}_lock() on x86 are
full memory barriers.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-09 17:06:58 -07:00
Jiri Olsa
a57de0b433 net: adding memory barrier to the poll and receive callbacks
Adding memory barrier after the poll_wait function, paired with
receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
to wrap the memory barrier.

Without the memory barrier, following race can happen.
The race fires, when following code paths meet, and the tp->rcv_nxt
and __add_wait_queue updates stay in CPU caches.

CPU1                         CPU2

sys_select                   receive packet
  ...                        ...
  __add_wait_queue           update tp->rcv_nxt
  ...                        ...
  tp->rcv_nxt check          sock_def_readable
  ...                        {
  schedule                      ...
                                if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
                                        wake_up_interruptible(sk->sk_sleep)
                                ...
                             }

If there was no cache the code would work ok, since the wait_queue and
rcv_nxt are opposit to each other.

Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
passed the tp->rcv_nxt check and sleeps, or will get the new value for
tp->rcv_nxt and will return with new data mask.
In both cases the process (CPU1) is being added to the wait queue, so the
waitqueue_active (CPU2) call cannot miss and will wake up CPU1.

The bad case is when the __add_wait_queue changes done by CPU1 stay in its
cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1 will then
endup calling schedule and sleep forever if there are no more data on the
socket.

Calls to poll_wait in following modules were ommited:
	net/bluetooth/af_bluetooth.c
	net/irda/af_irda.c
	net/irda/irnet/irnet_ppp.c
	net/mac80211/rc80211_pid_debugfs.c
	net/phonet/socket.c
	net/rds/af_rds.c
	net/rfkill/core.c
	net/sunrpc/cache.c
	net/sunrpc/rpc_pipe.c
	net/tipc/socket.c

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-09 17:06:57 -07:00
Cyrill Gorcunov
e04af024b2 net, netns_xt: shrink netns_xt members
In case if kernel was compiled without ebtables support
there is no need to keep ebt_table pointers in netns_xt
structure.

Make it config dependent.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-05 19:16:18 -07:00
Rami Rosen
af794c7424 cleanup: remove unused member in scm_cookie.
This patch removes an unused member (seq) scm_cookie; besides initialized
to 0 in the header file, it is not used.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-05 19:16:10 -07:00
David S. Miller
53bd9728bf Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6 2009-06-29 19:22:31 -07:00
Patrick McHardy
a3a9f79e36 netfilter: tcp conntrack: fix unacknowledged data detection with NAT
When NAT helpers change the TCP packet size, the highest seen sequence
number needs to be corrected. This is currently only done upwards, when
the packet size is reduced the sequence number is unchanged. This causes
TCP conntrack to falsely detect unacknowledged data and decrease the
timeout.

Fix by updating the highest seen sequence number in both directions after
packet mangling.

Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-06-29 14:07:56 +02:00
Rémi Denis-Courmont
c7a1a4c80f Phonet: publicize the Netlink notification function
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-25 02:58:15 -07:00
Linus Torvalds
09ce42d316 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6:
  bnx2: Fix the behavior of ethtool when ONBOOT=no
  qla3xxx: Don't sleep while holding lock.
  qla3xxx: Give the PHY time to come out of reset.
  ipv4 routing: Ensure that route cache entries are usable and reclaimable with caching is off
  net: Move rx skb_orphan call to where needed
  ipv6: Use correct data types for ICMPv6 type and code
  net: let KS8842 driver depend on HAS_IOMEM
  can: let SJA1000 driver depend on HAS_IOMEM
  netxen: fix firmware init handshake
  netxen: fix build with without CONFIG_PM
  netfilter: xt_rateest: fix comparison with self
  netfilter: xt_quota: fix incomplete initialization
  netfilter: nf_log: fix direct userspace memory access in proc handler
  netfilter: fix some sparse endianess warnings
  netfilter: nf_conntrack: fix conntrack lookup race
  netfilter: nf_conntrack: fix confirmation race condition
  netfilter: nf_conntrack: death_by_timeout() fix
2009-06-24 10:01:12 -07:00
Herbert Xu
d55d87fdff net: Move rx skb_orphan call to where needed
In order to get the tun driver to account packets, we need to be
able to receive packets with destructors set.  To be on the safe
side, I added an skb_orphan call for all protocols by default since
some of them (IP in particular) cannot handle receiving packets
destructors properly.

Now it seems that at least one protocol (CAN) expects to be able
to pass skb->sk through the rx path without getting clobbered.

So this patch attempts to fix this properly by moving the skb_orphan
call to where it's actually needed.  In particular, I've added it
to skb_set_owner_[rw] which is what most users of skb->destructor
call.

This is actually an improvement for tun too since it means that
we only give back the amount charged to the socket when the skb
is passed to another socket that will also be charged accordingly.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Oliver Hartkopp <olver@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-23 16:36:25 -07:00
Brian Haley
d5fdd6babc ipv6: Use correct data types for ICMPv6 type and code
Change all the code that deals directly with ICMPv6 type and code
values to use u8 instead of a signed int as that's the actual data
type.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-23 04:31:07 -07:00
Linus Torvalds
5165aece0e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (43 commits)
  via-velocity: Fix velocity driver unmapping incorrect size.
  mlx4_en: Remove redundant refill code on RX
  mlx4_en: Removed redundant check on lso header size
  mlx4_en: Cancel port_up check in transmit function
  mlx4_en: using stop/start_all_queues
  mlx4_en: Removed redundant skb->len check
  mlx4_en: Counting all the dropped packets on the TX side
  usbnet cdc_subset: fix issues talking to PXA gadgets
  Net: qla3xxx, remove sleeping in atomic
  ipv4: fix NULL pointer + success return in route lookup path
  isdn: clean up documentation index
  cfg80211: validate station settings
  cfg80211: allow setting station parameters in mesh
  cfg80211: allow adding/deleting stations on mesh
  ath5k: fix beacon_int handling
  MAINTAINERS: Fix Atheros pattern paths
  ath9k: restore PS mode, before we put the chip into FULL SLEEP state.
  ath9k: wait for beacon frame along with CAB
  acer-wmi: fix rfkill conversion
  ath5k: avoid PCI FATAL interrupts by restoring RETRY_TIMEOUT disabling
  ...
2009-06-22 11:57:09 -07:00
Hendrik Brueckner
0ea920d211 af_iucv: Return -EAGAIN if iucv msg limit is exceeded
If the iucv message limit for a communication path is exceeded,
sendmsg() returns -EAGAIN instead of -EPIPE.
The calling application can then handle this error situtation,
e.g. to try again after waiting some time.

For blocking sockets, sendmsg() waits up to the socket timeout
before returning -EAGAIN. For the new wait condition, a macro
has been introduced and the iucv_sock_wait_state() has been
refactored to this macro.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-19 00:10:40 -07:00
Linus Torvalds
d2aa455037 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (55 commits)
  netxen: fix tx ring accounting
  netxen: fix detection of cut-thru firmware mode
  forcedeth: fix dma api mismatches
  atm: sk_wmem_alloc initial value is one
  net: correct off-by-one write allocations reports
  via-velocity : fix no link detection on boot
  Net / e100: Fix suspend of devices that cannot be power managed
  TI DaVinci EMAC : Fix rmmod error
  net: group address list and its count
  ipv4: Fix fib_trie rebalancing, part 2
  pkt_sched: Update drops stats in act_police
  sky2: version 1.23
  sky2: add GRO support
  sky2: skb recycling
  sky2: reduce default transmit ring
  sky2: receive counter update
  sky2: fix shutdown synchronization
  sky2: PCI irq issues
  sky2: more receive shutdown
  sky2: turn off pause during shutdown
  ...

Manually fix trivial conflict in net/core/skbuff.c due to kmemcheck
2009-06-18 14:07:15 -07:00
Eric Dumazet
c564039fd8 net: sk_wmem_alloc has initial value of one, not zero
commit 2b85a34e91
(net: No more expensive sock_hold()/sock_put() on each tx)
changed initial sk_wmem_alloc value.

Some protocols check sk_wmem_alloc value to determine if a timer
must delay socket deallocation. We must take care of the sk_wmem_alloc
value being one instead of zero when no write allocations are pending.

Reported by Ingo Molnar, and full diagnostic from David Miller.

This patch introduces three helpers to get read/write allocations
and a followup patch will use these helpers to report correct
write allocations to user.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-17 04:31:25 -07:00
Linus Torvalds
b3fec0fe35 Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck: (39 commits)
  signal: fix __send_signal() false positive kmemcheck warning
  fs: fix do_mount_root() false positive kmemcheck warning
  fs: introduce __getname_gfp()
  trace: annotate bitfields in struct ring_buffer_event
  net: annotate struct sock bitfield
  c2port: annotate bitfield for kmemcheck
  net: annotate inet_timewait_sock bitfields
  ieee1394/csr1212: fix false positive kmemcheck report
  ieee1394: annotate bitfield
  net: annotate bitfields in struct inet_sock
  net: use kmemcheck bitfields API for skbuff
  kmemcheck: introduce bitfield API
  kmemcheck: add opcode self-testing at boot
  x86: unify pte_hidden
  x86: make _PAGE_HIDDEN conditional
  kmemcheck: make kconfig accessible for other architectures
  kmemcheck: enable in the x86 Kconfig
  kmemcheck: add hooks for the page allocator
  kmemcheck: add hooks for page- and sg-dma-mappings
  kmemcheck: don't track page tables
  ...
2009-06-16 13:09:51 -07:00
David S. Miller
14ebaf81e1 x25: Fix sleep from timer on socket destroy.
If socket destuction gets delayed to a timer, we try to
lock_sock() from that timer which won't work.

Use bh_lock_sock() in that case.

Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Ingo Molnar <mingo@elte.hu>
2009-06-16 05:40:30 -07:00
Vegard Nossum
a98b65a3ad net: annotate struct sock bitfield
2009/2/24 Ingo Molnar <mingo@elte.hu>:
> ok, this is the last warning i have from today's overnight -tip
> testruns - a 32-bit system warning in sock_init_data():
>
> [    2.610389] NET: Registered protocol family 16
> [    2.616138] initcall netlink_proto_init+0x0/0x170 returned 0 after 7812 usecs
> [    2.620010] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f642c184)
> [    2.624002] 010000000200000000000000604990c000000000000000000000000000000000
> [    2.634076]  i i i i i i u u i i i i i i i i i i i i i i i i i i i i i i i i
> [    2.641038]          ^
> [    2.643376]
> [    2.644004] Pid: 1, comm: swapper Not tainted (2.6.29-rc6-tip-01751-g4d1c22c-dirty #885)
> [    2.648003] EIP: 0060:[<c07141a1>] EFLAGS: 00010282 CPU: 0
> [    2.652008] EIP is at sock_init_data+0xa1/0x190
> [    2.656003] EAX: 0001a800 EBX: f6836c00 ECX: 00463000 EDX: c0e46fe0
> [    2.660003] ESI: f642c180 EDI: c0b83088 EBP: f6863ed8 ESP: c0c412ec
> [    2.664003]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> [    2.668003] CR0: 8005003b CR2: f682c400 CR3: 00b91000 CR4: 000006f0
> [    2.672003] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [    2.676003] DR6: ffff4ff0 DR7: 00000400
> [    2.680002]  [<c07423e5>] __netlink_create+0x35/0xa0
> [    2.684002]  [<c07443cc>] netlink_kernel_create+0x4c/0x140
> [    2.688002]  [<c072755e>] rtnetlink_net_init+0x1e/0x40
> [    2.696002]  [<c071b601>] register_pernet_operations+0x11/0x30
> [    2.700002]  [<c071b72c>] register_pernet_subsys+0x1c/0x30
> [    2.704002]  [<c0bf3c8c>] rtnetlink_init+0x4c/0x100
> [    2.708002]  [<c0bf4669>] netlink_proto_init+0x159/0x170
> [    2.712002]  [<c0101124>] do_one_initcall+0x24/0x150
> [    2.716002]  [<c0bbf3c7>] do_initcalls+0x27/0x40
> [    2.723201]  [<c0bbf3fc>] do_basic_setup+0x1c/0x20
> [    2.728002]  [<c0bbfb8a>] kernel_init+0x5a/0xa0
> [    2.732002]  [<c0103e47>] kernel_thread_helper+0x7/0x10
> [    2.736002]  [<ffffffff>] 0xffffffff

We fix this false positive by annotating the bitfield in struct
sock.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
2009-06-15 15:49:36 +02:00
Vegard Nossum
9e337b0fb3 net: annotate inet_timewait_sock bitfields
The use of bitfields here would lead to false positive warnings with
kmemcheck. Silence them.

(Additionally, one erroneous comment related to the bitfield was also
fixed.)

Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
2009-06-15 15:49:32 +02:00
Vegard Nossum
45e3ff8270 net: annotate bitfields in struct inet_sock
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
2009-06-15 15:49:27 +02:00
Jarek Poplawski
ca44d6e60f pkt_sched: Rename PSCHED_US2NS and PSCHED_NS2US
Let's use TICKS instead of US, so PSCHED_TICKS2NS and PSCHED_NS2TICKS
(like in PSCHED_TICKS_PER_SEC already) to avoid misleading.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-15 02:31:47 -07:00
Pablo Neira Ayuso
dd7669a92c netfilter: conntrack: optional reliable conntrack event delivery
This patch improves ctnetlink event reliability if one broadcast
listener has set the NETLINK_BROADCAST_ERROR socket option.

The logic is the following: if an event delivery fails, we keep
the undelivered events in the missed event cache. Once the next
packet arrives, we add the new events (if any) to the missed
events in the cache and we try a new delivery, and so on. Thus,
if ctnetlink fails to deliver an event, we try to deliver them
once we see a new packet. Therefore, we may lose state
transitions but the userspace process gets in sync at some point.

At worst case, if no events were delivered to userspace, we make
sure that destroy events are successfully delivered. Basically,
if ctnetlink fails to deliver the destroy event, we remove the
conntrack entry from the hashes and we insert them in the dying
list, which contains inactive entries. Then, the conntrack timer
is added with an extra grace timeout of random32() % 15 seconds
to trigger the event again (this grace timeout is tunable via
/proc). The use of a limited random timeout value allows
distributing the "destroy" resends, thus, avoiding accumulating
lots "destroy" events at the same time. Event delivery may
re-order but we can identify them by means of the tuple plus
the conntrack ID.

The maximum number of conntrack entries (active or inactive) is
still handled by nf_conntrack_max. Thus, we may start dropping
packets at some point if we accumulate a lot of inactive conntrack
entries that did not successfully report the destroy event to
userspace.

During my stress tests consisting of setting a very small buffer
of 2048 bytes for conntrackd and the NETLINK_BROADCAST_ERROR socket
flag, and generating lots of very small connections, I noticed
very few destroy entries on the fly waiting to be resend.

A simple way to test this patch consist of creating a lot of
entries, set a very small Netlink buffer in conntrackd (+ a patch
which is not in the git tree to set the BROADCAST_ERROR flag)
and invoke `conntrack -F'.

For expectations, no changes are introduced in this patch.
Currently, event delivery is only done for new expectations (no
events from expectation expiration, removal and confirmation).
In that case, they need a per-expectation event cache to implement
the same idea that is exposed in this patch.

This patch can be useful to provide reliable flow-accouting. We
still have to add a new conntrack extension to store the creation
and destroy time.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-06-13 12:30:52 +02:00
Pablo Neira Ayuso
9858a3ae1d netfilter: conntrack: move helper destruction to nf_ct_helper_destroy()
This patch moves the helper destruction to a function that lives
in nf_conntrack_helper.c. This new function is used in the patch
to add ctnetlink reliable event delivery.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-06-13 12:28:22 +02:00
Pablo Neira Ayuso
a0891aa6a6 netfilter: conntrack: move event caching to conntrack extension infrastructure
This patch reworks the per-cpu event caching to use the conntrack
extension infrastructure.

The main drawback is that we consume more memory per conntrack
if event delivery is enabled. This patch is required by the
reliable event delivery that follows to this patch.

BTW, this patch allows you to enable/disable event delivery via
/proc/sys/net/netfilter/nf_conntrack_events in runtime, although
you can still disable event caching as compilation option.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-06-13 12:26:29 +02:00
Patrick McHardy
36432dae73 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2009-06-11 16:00:49 +02:00
David S. Miller
bb400801c2 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-next-2.6 2009-06-11 05:47:43 -07:00
Eric Dumazet
2b85a34e91 net: No more expensive sock_hold()/sock_put() on each tx
One of the problem with sock memory accounting is it uses
a pair of sock_hold()/sock_put() for each transmitted packet.

This slows down bidirectional flows because the receive path
also needs to take a refcount on socket and might use a different
cpu than transmit path or transmit completion path. So these
two atomic operations also trigger cache line bounces.

We can see this in tx or tx/rx workloads (media gateways for example),
where sock_wfree() can be in top five functions in profiles.

We use this sock_hold()/sock_put() so that sock freeing
is delayed until all tx packets are completed.

As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
by one unit at init time, until sk_free() is called.
Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
to decrement initial offset and atomicaly check if any packets
are in flight.

skb_set_owner_w() doesnt call sock_hold() anymore

sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
reached 0 to perform the final freeing.

Drawback is that a skb->truesize error could lead to unfreeable sockets, or
even worse, prematurely calling __sk_free() on a live socket.

Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
contention point. 5 % speedup on a UDP transmit workload (depends
on number of flows), lowering TX completion cpu usage.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-11 02:55:43 -07:00
Johannes Berg
8f77f3849c mac80211: do not pass PS frames out of mac80211 again
In order to handle powersave frames properly we had needed
to pass these out to the device queues again, and introduce
the skb->requeue bit. This, however, also has unnecessary
overhead by needing to 'clean up' already tried frames, and
this clean-up code is also buggy when software encryption
is used.

Instead of sending the frames via the master netdev queue
again, simply put them into the pending queue. This also
fixes a problem where frames for that particular station
could be reordered when some were still on the software
queues and older ones are re-injected into the software
queue after them.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2009-06-10 13:28:37 -04:00
Patrick McHardy
440f0d5885 netfilter: nf_conntrack: use per-conntrack locks for protocol data
Introduce per-conntrack locks and use them instead of the global protocol
locks to avoid contention. Especially tcp_lock shows up very high in
profiles on larger machines.

This will also allow to simplify the upcoming reliable event delivery patches.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-06-10 14:32:47 +02:00