Introduce sk_tx_queue_mapping; and functions that set, test and
get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
cache is set/reset, and in socket alloc. Setting txq to -1 and
using valid txq=<0 to n-1> allows the tx path to use the value
of sk_tx_queue_mapping directly instead of subtracting 1 on every
tx.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 9e337b0f (net: annotate inet_timewait_sock bitfields)
added 4/8 bytes in struct inet_timewait_sock.
Fix this by declaring tw_ipv6_offset in the 'flags' bitfield
The 14 bits hole is named tw_pad to make it cleary apparent.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It can help being able to filter packets on their queue_mapping.
If filter performance is not good, we could add a "numqueue" field
in struct packet_type, so that netif_nit_deliver() and other functions
can directly ignore packets with not expected queue number.
Lets experiment this simple filter extension first.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the private functions alloc_can_skb() and
alloc_can_err_skb() of the at91_can driver public and adapts all
drivers to use these. While making the patch I realized, that
the skb's are *not* setup consistently. It's now done as shown
below:
skb->protocol = htons(ETH_P_CAN);
skb->pkt_type = PACKET_BROADCAST;
skb->ip_summed = CHECKSUM_UNNECESSARY;
*cf = (struct can_frame *)skb_put(skb, sizeof(struct can_frame));
memset(*cf, 0, sizeof(struct can_frame));
The frame is zeroed out to avoid uninitialized data to be passed to
user space. Some drivers or library code did not set "pkt_type" or
"ip_summed". Also, "__constant_htons()" should not be used for
runtime invocations, as pointed out by David Miller.
Signed-off-by: Wolfgang Grandegger <wg@grandegger.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow bpf to set a filter to drop packets that dont
match a specific mark
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
The i2400m driver uses two different bits to distinguish how much the
driver is up. i2400m->ready is used to denote that the infrastructure
to communicate with the device is up and running. i2400m->updown is
used to indicate if 'ready' and the device is up and running, ready to
take control and data traffic.
However, all this was pretty dirty and not clear, with many open spots
where race conditions were present.
This commit cleans up the situation by:
- documenting the usage of both bits
- setting them only in specific, well controlled places
(i2400m_dev_start, i2400m_dev_stop)
- ensuring the i2400m workqueue can't get in the middle of the
setting by flushing it when i2400m->ready is set to zero. This
allows the report hook not having to check again for the bit to be
set [rx.c:i2400m_report_hook_work()].
- using i2400m->updown to determine if the device is up and running
instead of the wimax state in i2400m_dev_reset_handle().
- not loosing missed messages sent by the hardware before
i2400m->ready is set. In rx.c, whatever the device sends can be
sent to user space over the message pipes as soon as the wimax
device is registered, so don't wait for i2400m->ready to be set.
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
Add support for the WiMAX device in the Intel WiFi/WiMAX Link 6050
Series; this involves:
- adding the device ID to bind to and an endpoint mapping for the
driver to use.
- at probe() time, some things are set depending on the device id:
+ the list of firmware names to try
+ mapping of endpoints
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
Different sdio device IDs are designated to support different intel
wimax silicon sku. The new macro SDIO_DEVICE_ID_IWMC3200_WIMAX_2G5(0x1407)
is added to support iwmc3200 2.5GHz sku. The existing
SDIO_DEVICE_ID_IWMC3200_WIMAX(0x1402) is for iwmc3200 general sku.
Signed-off-by: Cindy H Kao <cindy.h.kao@intel.com>
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
The i2400m based devices can get in a sort of a deadlock some times;
when they boot, they send a reboot "barker" (a magic number) and then
the driver has to echo that same barker to ack reception
(echo/ack). Then the device does a final ack by sending an ACK barker.
The first time this happens, we don't know ahead of time with barker
the device is going to send, as different device models and SKUs will
send different barker depending on the EEPROM programming.
If the device has sent the barker before the driver has been able to
read it, the driver looses, as it doesn't know which barker it has to
echo/ack back. With older devices, we tried a couple of combinations
and that always worked; but now, with adding support for more, in
which we have an unlimited number of new barkers, that is not an
option.
So we rework said case so that when the device gets stuck, we just
cycle through all the known types until one forces the device to send
an ack. Otherwise, the driver gives up and aborts.
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
The i2400m based devices can boot two main types of firmware images:
signed and non-signed. Signed images have signature data included that
must match that of a certificate stored in the device.
Currently the code is making the decission on what type of firmware
load (signed vs non-signed) is going to be loaded based on a hardcoded
decission in __i2400m_ack_verify(), based on the barker the device
sent upon boot.
This is not flexible enough as future hardware will emit more barkers;
thus the bit has to be set in a place where there is better knowledge
of what is going on. This will be done in follow-up commits -- however
this patch paves the way for it.
So the querying of the mode is packed into i2400m_boot_is_signed();
the main changes are just using i2400m_boot_is_signed() to determine
the method to follow and setting i2400m->sboot in
i2400m_is_boot_barker(). The modifications in i2400m_dnload_init() and
i2400m_dnload_finalize() are just reorganizing the order of the if
blocks and thus look larger than they really are.
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
Add "debug" module options to all the wimax modules (including
drivers) so that the debug levels can be set upon kernel boot or
module load time.
This is needed as currently there was a limitation where the debug
levels could only be set when a device was succesfully
enumerated. This made it difficult to debug issues that made a device
not probe properly.
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
The last users of skb_icv_walk are converted to ahash now,
so skb_icv_walk is unused and can be removed.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ah4 and ah6 are converted to ahash now, so we can remove the
code for the obsolete hash algorithm.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To support for ahash algorithms, we add a pointer to a
crypto_ahash to ah_data.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.
Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)
This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Phonet "universe" only has 64 addresses, so we keep a trivial flat
routing table.
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend the driver to accept a MAC address specified in platform_data.
Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows the CAN controller driver to define the number of echo
skb's used for the local loopback (echo), as suggested by Kurt Van
Dijck, with the function:
struct net_device *alloc_candev(int sizeof_priv,
unsigned int echo_skb_max);
The CAN drivers have been adapted accordingly. For the ems_usb driver,
as suggested by Sebastian Haas, the number of echo skb's has been
increased to 10, which improves the transmission performance a lot.
Signed-off-by: Wolfgang Grandegger <wg@grandegger.com>
Signed-off-by: Kurt Van Dijck <kurt.van.dijck@eia.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of hardcoding NET_IP_ALIGN stuff in various network drivers,
we can add a helper around netdev_alloc_skb()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Storing the mask (size - 1) instead of the size allows fast path to be
a bit faster.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Meaning receive multiple messages, reducing the number of syscalls and
net stack entry/exit operations.
Next patches will introduce mechanisms where protocols that want to
optimize this operation will provide an unlocked_recvmsg operation.
This takes into account comments made by:
. Paul Moore: sock_recvmsg is called only for the first datagram,
sock_recvmsg_nosec is used for the rest.
. Caitlin Bestler: recvmmsg now has a struct timespec timeout, that
works in the same fashion as the ppoll one.
If the underlying protocol returns a datagram with MSG_OOB set, this
will make recvmmsg return right away with as many datagrams (+ the OOB
one) it has received so far.
. Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen
datagrams and then recvmsg returns an error, recvmmsg will return
the successfully received datagrams, store the error and return it
in the next call.
This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg,
where we will be able to acquire the lock only at batch start and end, not at
every underlying recvmsg call.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new socket level option to report number of queue overflows
Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames. This value was
exported via a SOL_PACKET level cmsg. AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option. As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames. It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count). Tested
successfully by me.
Notes:
1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.
2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me. This also saves us having
to code in a per-protocol opt in mechanism.
3) This applies cleanly to net-next assuming that commit
977750076d (my af packet cmsg patch) is reverted
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ieee80211_rx() must be called with softirqs disabled
since the networking stack requires this for netif_rx()
and some code in mac80211 can assume that it can not
be processing its own tasklet and this call at the same
time.
It may be possible to remove this requirement after a
careful audit of mac80211 and doing any needed locking
improvements in it along with disabling softirqs around
netif_rx(). An alternative might be to push all packet
processing to process context in mac80211, instead of
to the tasklet, and add other synchronisation.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Since commit a98b65a3 (net: annotate struct sock bitfield), we lost
8 bytes in struct sock on 64bit arches because of
kmemcheck_bitfield_end(flags) misplacement.
Fix this by putting together sk_shutdown, sk_no_check, sk_userlocks,
sk_protocol and sk_type in the 'flags' 32bits bitfield
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(This patch fixes bug of commit f7734fdf61
title "make TLLAO option for NA packets configurable")
When the IPV6 conf is used, the function sysctl_set_parent is called and the
array addrconf_sysctl is used as a parameter of the function.
The above patch added new conf "force_tllao" into the array addrconf_sysctl,
but the size of the array was not modified, the static allocated size is
DEVCONF_MAX + 1 but the real size is DEVCONF_MAX + 2, so the problem is
that the function sysctl_set_parent accessed wrong address.
I got the following information.
Call Trace:
[<ffffffff8106085d>] sysctl_set_parent+0x29/0x3e
[<ffffffff8106085d>] sysctl_set_parent+0x29/0x3e
[<ffffffff8106085d>] sysctl_set_parent+0x29/0x3e
[<ffffffff8106085d>] sysctl_set_parent+0x29/0x3e
[<ffffffff8106085d>] sysctl_set_parent+0x29/0x3e
[<ffffffff810622d5>] __register_sysctl_paths+0xde/0x272
[<ffffffff8110892d>] ? __kmalloc_track_caller+0x16e/0x180
[<ffffffffa00cfac3>] ? __addrconf_sysctl_register+0xc5/0x144 [ipv6]
[<ffffffff8141f2c9>] register_net_sysctl_table+0x48/0x4b
[<ffffffffa00cfaf5>] __addrconf_sysctl_register+0xf7/0x144 [ipv6]
[<ffffffffa00cfc16>] addrconf_init_net+0xd4/0x104 [ipv6]
[<ffffffff8139195f>] setup_net+0x35/0x82
[<ffffffff81391f6c>] copy_net_ns+0x76/0xe0
[<ffffffff8107ad60>] create_new_namespaces+0xf0/0x16e
[<ffffffff8107afee>] copy_namespaces+0x65/0x9f
[<ffffffff81056dff>] copy_process+0xb2c/0x12c3
[<ffffffff810576e1>] do_fork+0x14b/0x2d2
[<ffffffff8107ac4e>] ? up_read+0xe/0x10
[<ffffffff81438e73>] ? do_page_fault+0x27a/0x2aa
[<ffffffff8101044b>] sys_clone+0x28/0x2a
[<ffffffff81011fb3>] stub_clone+0x13/0x20
[<ffffffff81011c72>] ? system_call_fastpath+0x16/0x1b
And the information of IPV6 in .config is as following.
IPV6 in .config:
CONFIG_IPV6=m
CONFIG_IPV6_PRIVACY=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
CONFIG_IPV6_OPTIMISTIC_DAD=y
CONFIG_IPV6_MIP6=m
CONFIG_IPV6_SIT=m
# CONFIG_IPV6_SIT_6RD is not set
CONFIG_IPV6_NDISC_NODETYPE=y
CONFIG_IPV6_TUNNEL=m
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_IPV6_SUBTREES=y
CONFIG_IPV6_MROUTE=y
CONFIG_IPV6_PIMSM_V2=y
# CONFIG_IP_VS_IPV6 is not set
CONFIG_NF_CONNTRACK_IPV6=m
CONFIG_IP6_NF_MATCH_IPV6HEADER=m
I confirmed this patch fixes this problem.
Signed-off-by: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TI HECC (High End CAN Controller) module is found on many TI devices. It
has 32 hardware mailboxes with full implementation of CAN protocol 2.0B
with bus speeds up to 1Mbps. Specifications of the module are available
on TI web <http://www.ti.com>
Signed-off-by: Anant Gole <anantgole@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP_HTABLE_SIZE was initialy defined to 128, which is a bit small for
several setups.
4000 active UDP sockets -> 32 sockets per chain in average. An
incoming frame has to lookup all sockets to find best match, so long
chains hurt latency.
Instead of a fixed size hash table that cant be perfect for every
needs, let UDP stack choose its table size at boot time like tcp/ip
route, using alloc_large_system_hash() helper
Add an optional boot parameter, uhash_entries=x so that an admin can
force a size between 256 and 65536 if needed, like thash_entries and
rhash_entries.
dmesg logs two new lines :
[ 0.647039] UDP hash table entries: 512 (order: 0, 4096 bytes)
[ 0.647099] UDP Lite hash table entries: 512 (order: 0, 4096 bytes)
Maximal size on 64bit arches would be 65536 slots, ie 1 MBytes for non
debugging spinlocks.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix build error introduced in commit fa857afcf - ipv6 sit: 6rd
(IPv6 Rapid Deployment) Support. Struct in6_addr is the issue.
I'm only seeing this on x86_64 systems, not on 32-bit with same
IPv6 config options, so it could be there's a missing forward
declaration somewhere, but including the correct header file
fixes the problem too.
CC [M] net/ipv6/ip6_tunnel.o
In file included from net/ipv6/ip6_tunnel.c:31:
include/linux/if_tunnel.h:59: error: field ‘prefix’ has incomplete type
make[2]: *** [net/ipv6/ip6_tunnel.o] Error 1
make[1]: *** [net/ipv6] Error 2
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's useful to provide firmware and hardware version to user space and have a
generic interface to retrieve them. Users can provide the version information
in bug reports etc.
Add fields for firmware and hardware version to struct wiphy.
(Dropped nl80211 bits for now and modified remaining bits in favor of
ethtool. -- JWL)
Cc: Kalle Valo <kalle.valo@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Refactor wext to
* split out iwpriv handling
* split out iwspy handling
* split out procfs support
* allow cfg80211 to have wireless extensions compat code
w/o CONFIG_WIRELESS_EXT
After this, drivers need to
- select WIRELESS_EXT - for wext support
- select WEXT_PRIV - for iwpriv support
- select WEXT_SPY - for iwspy support
except cfg80211 -- which gets new hooks in wext-core.c
and can then get wext handlers without CONFIG_WIRELESS_EXT.
Wireless extensions procfs support is auto-selected
based on PROC_FS and anything that requires the wext core
(i.e. WIRELESS_EXT or CFG80211_WEXT).
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Linux keeps scan results up to 15 seconds. This can be a problem for fast
moving clients: they get back stale data. But if the kernel reports the age
of the BSS items, then user-space can simply weed out old entries by itself.
Signed-off-by: Holger Schurig <hs4233@mail.mn-solutions.de>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The data center bridging ops structure can be const
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All usages of structure net_proto_ops should be declared const.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On Friday 02 October 2009 20:53:51 you wrote:
> This is good although I would have shortened the name.
Ah, I knew I forgot something :) Here is v4.
tavi
>From 24d96d825b9fa832b22878cc6c990d5711968734 Mon Sep 17 00:00:00 2001
From: Octavian Purdila <opurdila@ixiacom.com>
Date: Fri, 2 Oct 2009 00:51:15 +0300
Subject: [PATCH] ipv6: new sysctl for sending TLLAO with unicast NAs
Neighbor advertisements responding to unicast neighbor solicitations
did not include the target link-layer address option. This patch adds
a new sysctl option (disabled by default) which controls whether this
option should be sent even with unicast NAs.
The need for this arose because certain routers expect the TLLAO in
some situations even as a response to unicast NS packets.
Moreover, RFC 2461 recommends sending this to avoid a race condition
(section 4.4, Target link-layer address)
Signed-off-by: Cosmin Ratiu <cratiu@ixiacom.com>
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After updating firmware stored in flash, users may wish to reset the
relevant hardware and start the new firmware immediately. This should
not be completely automatic as it may be disruptive.
A selective reset may also be useful for debugging or diagnostics.
This adds a separate reset operation which takes flags indicating the
components to be reset. Drivers are allowed to reset only a subset of
those requested, and must indicate the actual subset. This allows the
use of generic component masks and some future expansion.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jarek Poplawski a écrit :
>
>
> Hmm... So you made me to do some "real" work here, and guess what?:
> there is one serious checkpatch warning! ;-) Plus, this new parameter
> should be added to the function description. Otherwise:
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
>
> Thanks,
> Jarek P.
>
> PS: I guess full "Don't" would show we really mean it...
Okay :) Here is the last round, before the night !
Thanks again
[RFC] pkt_sched: gen_estimator: Don't report fake rate estimators
We currently send TCA_STATS_RATE_EST elements to netlink users, even if no estimator
is running.
# tc -s -d qdisc
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 112833764978 bytes 1495081739 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
User has no way to tell if the "rate 0bit 0pps" is a real estimation, or a fake
one (because no estimator is active)
After this patch, tc command output is :
$ tc -s -d qdisc
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 561075 bytes 1196 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
We add a parameter to gnet_stats_copy_rate_est() function so that
it can use gen_estimator_active(bstats, r), as suggested by Jarek.
This parameter can be NULL if check is not necessary, (htb for
example has a mandatory rate estimator)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 Rapid Deployment (6rd; draft-ietf-softwire-ipv6-6rd) builds upon
mechanisms of 6to4 (RFC3056) to enable a service provider to rapidly
deploy IPv6 unicast service to IPv4 sites to which it provides
customer premise equipment. Like 6to4, it utilizes stateless IPv6 in
IPv4 encapsulation in order to transit IPv4-only network
infrastructure. Unlike 6to4, a 6rd service provider uses an IPv6
prefix of its own in place of the fixed 6to4 prefix.
With this option enabled, the SIT driver offers 6rd functionality by
providing additional ioctl API to configure the IPv6 Prefix for in
stead of static 2002::/16 for 6to4.
Original patch was done by Alexandre Cassen <acassen@freebox.fr>
based on old Internet-Draft.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When routing daemon wants to enable forwarding of multicast traffic it
performs something like:
struct vifctl vc = {
.vifc_vifi = 1,
.vifc_flags = 0,
.vifc_threshold = 1,
.vifc_rate_limit = 0,
.vifc_lcl_addr = ip, /* <--- ip address of physical
interface, e.g. eth0 */
.vifc_rmt_addr.s_addr = htonl(INADDR_ANY),
};
setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));
This leads (in the kernel) to calling vif_add() function call which
search the (physical) device using assigned IP address:
dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr);
The current API (struct vifctl) does not allow to specify an
interface other way than using it's IP, and if there are more than a
single interface with specified IP only the first one will be found.
The attached patch (against 2.6.30.4) allows to specify an interface
by its index, instead of IP address:
struct vifctl vc = {
.vifc_vifi = 1,
.vifc_flags = VIFF_USE_IFINDEX, /* NEW */
.vifc_threshold = 1,
.vifc_rate_limit = 0,
.vifc_lcl_ifindex = if_nametoindex("eth0"), /* NEW */
.vifc_rmt_addr.s_addr = htonl(INADDR_ANY),
};
setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));
Signed-off-by: Ilia K. <mail4ilia@gmail.com>
=== modified file 'include/linux/mroute.h'
Signed-off-by: David S. Miller <davem@davemloft.net>
An incoming datagram must bring into cpu cache *lot* of cache lines,
in particular : (other parts omitted (hash chains, ip route cache...))
On 32bit arches :
offsetof(struct sock, sk_rcvbuf) =0x30 (read)
offsetof(struct sock, sk_lock) =0x34 (rw)
offsetof(struct sock, sk_sleep) =0x50 (read)
offsetof(struct sock, sk_rmem_alloc) =0x64 (rw)
offsetof(struct sock, sk_receive_queue)=0x74 (rw)
offsetof(struct sock, sk_forward_alloc)=0x98 (rw)
offsetof(struct sock, sk_callback_lock)=0xcc (rw)
offsetof(struct sock, sk_drops) =0xd8 (read if we add dropcount support, rw if frame dropped)
offsetof(struct sock, sk_filter) =0xf8 (read)
offsetof(struct sock, sk_socket) =0x138 (read)
offsetof(struct sock, sk_data_ready) =0x15c (read)
We can avoid sk->sk_socket and socket->fasync_list referencing on sockets
with no fasync() structures. (socket->fasync_list ptr is probably already in cache
because it shares a cache line with socket->wait, ie location pointed by sk->sk_sleep)
This avoids one cache line load per incoming packet for common cases (no fasync())
We can leave (or even move in a future patch) sk->sk_socket in a cold location
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For various purposes including a wireless extensions
bugfix, we need to hook into the netdev creation before
before netdev_register_kobject(). This will also ease
doing the dev type assignment that Marcel was working
on for cfg80211 drivers w/o touching them all.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for usbnet based devices like CDC-Ether to indicate that they
are actually mobile broadband devices. In that case use wwan%d as default
interface name.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following user-space program fails to compile:
#include <linux/socket.h>
#include <sys/socket.h>
int main() { return 0; }
The reason is that <linux/socket.h> tests __GLIBC__ to decide whether it
should define various structures and macros that are now defined for
user-space by <sys/socket.h>, but __GLIBC__ is not defined if no libc
headers have yet been included.
It seems safe to drop support for libc 5 now.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Bastian Blank <waldi@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently dirty a cache line to update tunnel device stats
(tx_packets/tx_bytes). We better use the txq->tx_bytes/tx_packets
counters that already are present in cpu cache, in the cache
line shared with txq->_xmit_lock
This patch extends IPTUNNEL_XMIT() macro to use txq pointer
provided by the caller.
Also &tunnel->dev->stats can be replaced by &dev->stats
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The FIB algorithim for IPV4 is set at compile time, but kernel goes through
the overhead of function call indirection at runtime. Save some
cycles by turning the indirect calls to direct calls to either
hash or trie code.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add Ancilliary data to better represent loss information
I've had a few requests recently to provide more detail regarding frame loss
during an AF_PACKET packet capture session. Specifically the requestors want to
see where in a packet sequence frames were lost, i.e. they want to see that 40
frames were lost between frames 302 and 303 in a packet capture file. In order
to do this we need:
1) The kernel to export this data to user space
2) The applications to make use of it
This patch addresses item (1). It does this by doing the following:
A) Anytime we drop a frame for which we would increment po->stats.tp_drops, we
also no increment a stats called po->stats.tp_gap.
B) Every time we successfully enqueue a frame to sk_receive_queue, we record the
value of po->stats.tp_gap in skb->mark. skb->cb would nominally be the place to
record this, but since all the space there is used up, we're overloading
skb->mark. Its safe to do since any enqueued packet is guaranteed to be
unshared at this point, and skb->mark isn't used for anything else in the rx
path to the application. After we record tp_gap in the skb, we zero
po->stats.tp_gap. This allows us to keep a counter of the number of frames lost
between any two enqueued packets
C) When the application goes to dequeue a frame from the packet socket, we look
at skb->mark for that frame. If it is non-zero, we add a cmsg chunk to the
msghdr of level SOL_PACKET and type PACKET_GAPDATA. Its a 32 bit integer that
represents the number of frames lost between this packet and the last previous
frame received.
Note there is a chance that if there is frame loss after a receive, and then the
socket is closed, some gap data might be lost. This is covered by the use of
the PACKET_AUXDATA socket option, which gives total loss data. With a bit of
math, the final gap can be determined that way.
I've tested this patch myself, and it works well.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
include/linux/if_packet.h | 2 ++
net/packet/af_packet.c | 33 +++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)
Signed-off-by: David S. Miller <davem@davemloft.net>
The in-tree implementations have all been converted to
get_sset_count().
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (41 commits)
Revert "Seperate read and write statistics of in_flight requests"
cfq-iosched: don't delay async queue if it hasn't dispatched at all
block: Topology ioctls
cfq-iosched: use assigned slice sync value, not default
cfq-iosched: rename 'desktop' sysfs entry to 'low_latency'
cfq-iosched: implement slower async initiate and queue ramp up
cfq-iosched: delay async IO dispatch, if sync IO was just done
cfq-iosched: add a knob for desktop interactiveness
Add a tracepoint for block request remapping
block: allow large discard requests
block: use normal I/O path for discard requests
swapfile: avoid NULL pointer dereference in swapon when s_bdev is NULL
fs/bio.c: move EXPORT* macros to line after function
Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs
cciss: fix build when !PROC_FS
block: Do not clamp max_hw_sectors for stacking devices
block: Set max_sectors correctly for stacking devices
cciss: cciss_host_attr_groups should be const
cciss: Dynamically allocate the drive_info_struct for each logical drive.
cciss: Add usage_count attribute to each logical drive in /sys
...
Not all users of the topology information want to use libblkid. Provide
the topology information through bdev ioctls.
Also clarify sector size comments for existing BLK ioctls.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This slowly ramps up the async queue depth based on the time
passed since the sync IO, and doesn't allow async at all until
a sync slice period has passed.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (46 commits)
cnic: Fix NETDEV_UP event processing.
uvesafb/connector: Disallow unpliviged users to send netlink packets
pohmelfs/connector: Disallow unpliviged users to configure pohmelfs
dst/connector: Disallow unpliviged users to configure dst
dm/connector: Only process connector packages from privileged processes
connector: Removed the destruct_data callback since it is always kfree_skb()
connector/dm: Fixed a compilation warning
connector: Provide the sender's credentials to the callback
connector: Keep the skb in cn_callback_data
e1000e/igb/ixgbe: Don't report an error if devices don't support AER
net: Fix wrong sizeof
net: splice() from tcp to pipe should take into account O_NONBLOCK
net: Use sk_mark for routing lookup in more places
sky2: irqname based on pci address
skge: use unique IRQ name
IPv4 TCP fails to send window scale option when window scale is zero
net/ipv4/tcp.c: fix min() type mismatch warning
Kconfig: STRIP: Remove stale bits of STRIP help text
NET: mkiss: Fix typo
tg3: Remove prev_vlan_tag from struct tx_ring_info
...
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch clean up/fixes for memcg's uncharge soft limit path.
Problems:
Now, res_counter_charge()/uncharge() handles softlimit information at
charge/uncharge and softlimit-check is done when event counter per memcg
goes over limit. Now, event counter per memcg is updated only when
memory usage is over soft limit. Here, considering hierarchical memcg
management, ancesotors should be taken care of.
Now, ancerstors(hierarchy) are handled in charge() but not in uncharge().
This is not good.
Prolems:
1. memcg's event counter incremented only when softlimit hits. That's bad.
It makes event counter hard to be reused for other purpose.
2. At uncharge, only the lowest level rescounter is handled. This is bug.
Because ancesotor's event counter is not incremented, children should
take care of them.
3. res_counter_uncharge()'s 3rd argument is NULL in most case.
ops under res_counter->lock should be small. No "if" sentense is better.
Fixes:
* Removed soft_limit_xx poitner and checks in charge and uncharge.
Do-check-only-when-necessary scheme works enough well without them.
* make event-counter of memcg incremented at every charge/uncharge.
(per-cpu area will be accessed soon anyway)
* All ancestors are checked at soft-limit-check. This is necessary because
ancesotor's event counter may never be modified. Then, they should be
checked at the same time.
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The asm-generic/gpio.h header uses the might_sleep() macro but doesn't
include the header for it, so any source code that might include
linux/gpio.h before linux/kernel.h can easily lead to a build failure.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since 2.6.31 now has request-based device-mapper, it's useful to have
a tracepoint for request-remapping as well as bio-remapping.
This patch adds a tracepoint for request-remapping, trace_block_rq_remap().
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently we set the bio size to the byte equivalent of the blocks to
be trimmed when submitting the initial DISCARD ioctl. That means it
is subject to the max_hw_sectors limitation of the HBA which is
much lower than the size of a DISCARD request we can support.
Add a separate max_discard_sectors tunable to limit the size for discard
requests.
We limit the max discard request size in bytes to 32bit as that is the
limit for bio->bi_size. This could be much larger if we had a way to pass
that information through the block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
prepare_discard_fn() was being called in a place where memory allocation
was effectively impossible. This makes it inappropriate for all but
the most trivial translations of Linux's DISCARD operation to the block
command set. Additionally adding a payload there makes the ownership
of the bio backing unclear as it's now allocated by the device driver
and not the submitter as usual.
It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether
the queue supports discard operations or not. blkdev_issue_discard now
allocates a one-page, sector-length payload which is the right thing
for the common ATA and SCSI implementations.
The mtd implementation of prepare_discard_fn() is replaced with simply
checking for the request being a discard.
Largely based on a previous patch from Matthew Wilcox <matthew@wil.cx>
which did the prepare_discard_fn but not the different payload allocation
yet.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This provides safety against negative optlen at the type
level instead of depending upon (sometimes non-trivial)
checks against this sprinkled all over the the place, in
each and every implementation.
Based upon work done by Arjan van de Ven and feedback
from Linus Torvalds.
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Fix time encoding with extra epoch bits
ext4: Add a stub for mpage_da_data in the trace header
jbd2: Use tracepoints for history file
ext4: Use tracepoints for mb_history trace file
ext4, jbd2: Drop unneeded printks at mount and unmount time
ext4: Handle nested ext4_journal_start/stop calls without a journal
ext4: Make sure ext4_dirty_inode() updates the inode in no journal mode
ext4: Avoid updating the inode table bh twice in no journal mode
ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first
ext4: async direct IO for holes and fallocate support
ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O
ext4: Split uninitialized extents for direct I/O
ext4: release reserved quota when block reservation for delalloc retry
ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks
ext4: Fix hueristic which avoids group preallocation for closed files
ext4: Use ext4_msg() for ext4_da_writepage() errors
ext4: Update documentation about quota mount options
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
PM / yenta: Fix cardbus suspend/resume regression
PM / PCMCIA: Drop second argument of pcmcia_socket_dev_suspend()
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (33 commits)
sony-laptop: re-read the rfkill state when resuming from suspend
sony-laptop: check for rfkill hard block at load time
wext: add back wireless/ dir in sysfs for cfg80211 interfaces
wext: Add bound checks for copy_from_user
mac80211: improve/fix mlme messages
cfg80211: always get BSS
iwlwifi: fix 3945 ucode info retrieval after failure
iwlwifi: fix memory leak in command queue handling
iwlwifi: fix debugfs buffer handling
cfg80211: don't set privacy w/o key
cfg80211: wext: don't display BSSID unless associated
net: Add explicit bound checks in net/socket.c
bridge: Fix double-free in br_add_if.
isdn: fix netjet/isdnhdlc build errors
atm: dereference of he_dev->rbps_virt in he_init_group()
ax25: Add missing dev_put in ax25_setsockopt
Revert "sit: stateless autoconf for isatap"
net: fix double skb free in dcbnl
net: fix nlmsg len size for skb when error bit is set.
net: fix vlan_get_size to include vlan_flags size
...
* 'drm-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: (25 commits)
drm/radeon/kms: Convert R520 to new init path and associated cleanup
drm/radeon/kms: Convert RV515 to new init path and associated cleanup
drm: fix radeon DRM warnings when !CONFIG_DEBUG_FS
drm: fix drm_fb_helper warning when !CONFIG_MAGIC_SYSRQ
drm/r600: fix memory leak introduced with 64k malloc avoidance fix.
drm/kms: make fb helper work for all drivers.
drm/radeon/r600: fix offset handling in CS parser
drm/radeon/kms/r600: fix forcing pci mode on agp cards
drm/radeon/kms: fix for the extra pages copying.
drm/radeon/kms/r600: add support for vline relocs
drm/radeon/kms: fix some bugs in vline reloc
drm/radeon/kms/r600: clamp vram to aperture size
drm/kms: protect against fb helper not being created.
drm/r600: get values from the passed in IB not the copy.
drm: create gitignore file for radeon
drm/radeon/kms: remove unneeded master create/destroy functions.
drm/kms: start adding command line interface using fb.
fb: change rules for global rules match.
drm/radeon/kms: don't require up to 64k allocations. (v2)
drm/radeon/kms: enable dac load detection by default.
...
Trivial conflicts in drivers/gpu/drm/radeon/radeon_asic.h due to adding
'->vga_set_state' function pointers.
The tracepoint ext4_da_write_pages has a struct mpage_da_data*
parameter, but that struct is only defined in fs/ext4/ext4.h. This
patch adds a forward declaration for that struct, so this tracepoint
header can still be used by tools like SystemTap.
This is a continuation of the fix in commit 3661d286.
http://sourceware.org/bugzilla/show_bug.cgi?id=10703
Signed-off-by: Josh Stone <jistone@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The /proc/fs/jbd2/<dev>/history was maintained manually; by using
tracepoints, we can get all of the existing functionality of the /proc
file plus extra capabilities thanks to the ftrace infrastructure. We
save memory as a bonus.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The /proc/fs/ext4/<dev>/mb_history was maintained manually, and had a
number of problems: it required a largish amount of memory to be
allocated for each ext4 filesystem, and the s_mb_history_lock
introduced a CPU contention problem.
By ripping out the mb_history code and replacing it with ftrace
tracepoints, and we get more functionality: timestamps, event
filtering, the ability to correlate mballoc history with other ext4
tracepoints, etc.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
pcmcia_socket_dev_suspend() doesn't use its second argument, so it
may be dropped safely.
This change is necessary for the subsequent yenta suspend/resume fix.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@kernel.org
The move away from having drivers assign wireless handlers,
in favour of making cfg80211 assign them, broke the sysfs
registration (the wireless/ dir went missing) because the
handlers are now assigned only after registration, which is
too late.
Fix this by special-casing cfg80211-based devices, all
of which are required to have an ieee80211_ptr, in the
sysfs code, and also using get_wireless_stats() to have
the same values reported as in procfs.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Reported-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Tested-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Work around problems in the writeback code to force out writebacks in
larger chunks than just 4mb, which is just too small. This also works
around limitations in the ext4 block allocator, which can't allocate
more than 2048 blocks at a time. So we need to defeat the round-robin
characteristics of the writeback code and try to write out as many
blocks in one inode before allowing the writeback code to move on to
another inode. We add a a new per-filesystem tunable,
max_writeback_mb_bump, which caps this to a default of 128mb per
inode.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This initialises the fb helper with the connector helper,
so that the fb cmdline code works for intel as well.
Signed-off-by: Dave Airlie <airlied@redhat.com>
The following commit made console open fails while booting:
commit b50989dc44
Author: Alan Cox <alan@linux.intel.com>
Date: Sat Sep 19 13:13:22 2009 -0700
tty: make the kref destructor occur asynchronously
Due to tty release routines run in a workqueue now, error like the
following will be reported while booting:
INIT open /dev/console Input/output error
It also causes hibernation regression to appear as reported at
http://bugzilla.kernel.org/show_bug.cgi?id=14229
The reason is that now there's latency issue with closing, but when
we open a "closing not finished" tty, -EIO will be returned.
Fix it as per the following Alan's suggestion:
Fun but it's actually not a bug and the fix is wrong in itself as
the port may be closing but not yet being destructed, in which case
it seems to do the wrong thing. Opening a tty that is closing (and
could be closing for long periods) is supposed to return -EIO.
I suspect a better way to deal with this and keep the old console
timing is to split tty->shutdown into two functions.
tty->shutdown() - called synchronously just before we dump the tty
onto the waitqueue for destruction
tty->cleanup() - called when the destructor runs.
We would then do the shutdown part which can occur in IRQ context
fine, before queueing the rest of the release (from tty->magic = 0
... the end) to occur asynchronously
The USB update in -next would then need a call like
if (tty->cleanup)
tty->cleanup(tty);
at the top of the async function and the USB shutdown to be split
between shutdown and cleanup as the USB resource cleanup and final
tidy cannot occur synchronously as it needs to sleep.
In other words the logic becomes
final kref put
make object unfindable
async
clean it up
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
[ rjw: Rebased on top of 2.6.31-git, reworked the changelog. ]
Signed-off-by: "Rafael J. Wysocki" <rjw@sisk.pl>
[ Changed serial naming to match new rules, dropped tty_shutdown as per
comments from Alan Stern - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code
But leave TTM code alone, something is fishy there with global vm_ops
being used.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 645069299a.
While the code does not actually break anything, it does not completely follow
RFC5214 yet. After talking back with Fred L. Templin, I agree that completing the
ISATAP specific RS/RA code, would pollute the kernel a lot with code that is better
implemented in userspace.
The kernel should not send RS packages for ISATAP at all.
Signed-off-by: Sascha Hlusiak <contact@saschahlusiak.de>
Acked-by: Fred L. Templin <Fred.L.Templin@boeing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.o-hand.com/linux-rpurdie-backlight:
backlight: new driver for ADP5520/ADP5501 MFD PMICs
backlight: extend event support to also support poll()
backlight/eeepc-laptop: Update the backlight state when we change brightness
backlight/acpi: Update the backlight state when we change brightness
backlight: Allow drivers to update the core, and generate events on changes
backlight: switch to da903x driver to dev_pm_ops
backlight: Add support for the Avionic Design Xanthos backlight device.
backlight: spi driver for LMS283GF05 LCD
backlight: move hp680-bl's probe function to .devinit.text
backlight: Add support for new Apple machines.
backlight: mbp_nvidia_bl: add support for MacBookAir 1,1
backlight: Add WM831x backlight driver
Trivial conflicts due to '#ifdef CONFIG_PM' differences in
drivers/video/backlight/da903x_bl.c
* remove asm/atomic.h inclusion from kref.h -- not needed, linux/types.h
is enough for atomic_t
* remove linux/kref.h inclusion from files which do not need it.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
modules, tracing: Remove stale struct marker signature from module_layout()
tracing/workqueue: Use %pf in workqueue trace events
tracing: Fix a comment and a trivial format issue in tracepoint.h
tracing: Fix failure path in ftrace_regex_open()
tracing: Fix failure path in ftrace_graph_write()
tracing: Check the return value of trace_get_user()
tracing: Fix off-by-one in trace_get_user()
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (32 commits)
ACPI: i2c-scmi: don't use acpi_device_uid()
ACPI: simplify building device HID/CID list
ACPI: remove acpi_device_uid() and related stuff
ACPI: remove acpi_device.flags.hardware_id
ACPI: remove acpi_device.flags.compatible_ids
ACPI: maintain a single list of _HID and _CID IDs
ACPI: make sure every acpi_device has an ID
ACPI: use acpi_device_hid() when possible
ACPI: fix synthetic HID for \_SB_
ACPI: handle re-enumeration, when acpi_devices might already exist
ACPI: factor out device type and status checking
ACPI: add acpi_bus_get_status_handle()
ACPI: use acpi_walk_namespace() to enumerate devices
ACPI: identify device tree root by null parent pointer, not ACPI_BUS_TYPE
ACPI: enumerate namespace before adding functional fixed hardware devices
ACPI: convert acpi_bus_scan() to operate on an acpi_handle
ACPI: add acpi_bus_get_parent() and remove "parent" arguments
ACPI: remove unnecessary argument checking
ACPI: remove redundant "type" arguments
ACPI: remove acpi_device_set_context() "type" argument
...
Sometimes we only want to write pages from a specific super_block,
so allow that to be passed in.
This fixes a problem with commit 56a131dcf7
causing writeback on all super_blocks on a bdi, where we only really
want to sync a specific sb from writeback_inodes_sb().
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Nobody uses acpi_device_uid(), so this patch removes it.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Every acpi_device has at least one ID (if there's no _HID or _CID, we
give it a synthetic or default ID). So there's no longer a need to
check whether an ID exists; we can just use it.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
We now keep a single list of IDs that includes both the _HID and any
_CIDs. We no longer need to keep track of whether the device has a _CID.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
There's no need to treat _HID and _CID differently. Keeping them in
a single list makes code that uses the IDs a little simpler because it
can just traverse the list rather than checking "do we have a HID?",
"do we have any CIDs?"
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Reviewed-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Add acpi_bus_get_status_handle() so we can get the status of a namespace
object before building a struct acpi_device.
This removes a use of "device->flags.dynamic_status", a cached indicator of
whether _STA exists. It seems simpler and more reliable to just evaluate
_STA and catch AE_NOT_FOUND errors.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
We can identify the root of the ACPI device tree by the fact that it
has no parent. This is simpler than passing around ACPI_BUS_TYPE_SYSTEM
and will help remove special treatment of the device tree root.
Currently, we add the root by hand with ACPI_BUS_TYPE_SYSTEM. If we
traverse the tree treating the root as just another device and use
acpi_get_type(), the root shows up as ACPI_TYPE_DEVICE.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Most uses of the ACPI bus device_type (ACPI_BUS_TYPE_DEVICE,
ACPI_BUS_TYPE_POWER, etc) are during device initialization, but
we do need it later for notify handler installation, since that
is different for fixed hardware devices vs. namespace devices.
This patch saves the device_type in the acpi_device structure,
so we can check that rather than comparing against the _HID string.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (94 commits)
genetlink: fix netns vs. netlink table locking (2)
3c59x: Get rid of "Trying to free already-free IRQ"
tunnel: eliminate recursion field
ems_pci: fix size of CAN controllers BAR mapping for CPC-PCI v2
net: fix htmldocs sunrpc, clnt.c
Phonet: error on broadcast sending (unimplemented)
Phonet: fix race for port number in concurrent bind()
pktgen: better scheduler friendliness
pktgen: T_TERMINATE flag is unused
ipv4: check optlen for IP_MULTICAST_IF option
ath9k: Initialize txgain and rxgain for newer AR9287 chipsets.
iwlagn: fix panic in iwl{5000,4965}_rx_reply_tx
ath9k: Fix RFKILL bugs
drivers/net/wireless: Use usb_endpoint_dir_out
cfg80211: don't overwrite privacy setting
wl12xx: fix kconfig/link errors
rt2x00: fix the definition of rt2x00crypto_rx_insert_iv
iwlwifi: reduce noise when skb allocation fails
iwlwifi: do not send sync command while holding spinlock
mac80211: fix DTIM setting
...
[note this requires an fb patch posted to linux-fbdev-devel already]
This uses the normal video= command line option to control the kms
output setup at boot time. It is used to override the autodetection
done by kms.
video= normally takes a framebuffer as the first parameter, in kms
it will take a connector name, DVI-I-1, or LVDS-1 etc. If no output
connector is specified the mode string will apply to all connectors.
The mode specification used will match down the probed modes, and if
no mode is found it will add a CVT mode that matches.
video=1024x768 - all connectors match a 1024x768 mode or add a CVT on
video=VGA-1:1024x768, VGA-1 connector gets mode only.
The same strings as used in current fb modedb.c are used, except I've
added three more letters, e, D, d, e = enable, D = enable Digital,
d = disable, which allow a connector to be forced into a certain state.
Signed-off-by: Dave Airlie <airlied@redhat.com>
The NOMMU fallback for is_vmalloc_or_module_addr() should be static inline,
not just static, in linux/mm.h.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The old RW_DATA_SECTION had INIT_TASK_DATA (which was
more-than-PAGE_SIZE-aligned), followed by a bunch of small alignment
stuff, followed by more PAGE_SIZE-aligned stuff, so you wasted memory
in the middle of .data re-aligning back up to PAGE_SIZE.
This patch sorts the sections by alignment requirements, which should
pack them essentially optimally.
Signed-off-by: Tim Abbott <tabbott@ksplice.com>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Why macros are always wrong:
mm/mmap.c: In function 'do_mmap_pgoff':
mm/mmap.c:953: warning: unused variable 'user'
also, move a couple of struct forward-decls outside `#ifdef
CONFIG_HUGETLB_PAGE' - it's pointless and frequently harmful to make these
conditional (eg, this patch needed `struct user_struct').
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to commit d136f1bd36,
there's a bug when unregistering a generic netlink family,
which is caught by the might_sleep() added in that commit:
BUG: sleeping function called from invalid context at net/netlink/af_netlink.c:183
in_atomic(): 1, irqs_disabled(): 0, pid: 1510, name: rmmod
2 locks held by rmmod/1510:
#0: (genl_mutex){+.+.+.}, at: [<ffffffff8138283b>] genl_unregister_family+0x2b/0x130
#1: (rcu_read_lock){.+.+..}, at: [<ffffffff8138270c>] __genl_unregister_mc_group+0x1c/0x120
Pid: 1510, comm: rmmod Not tainted 2.6.31-wl #444
Call Trace:
[<ffffffff81044ff9>] __might_sleep+0x119/0x150
[<ffffffff81380501>] netlink_table_grab+0x21/0x100
[<ffffffff813813a3>] netlink_clear_multicast_users+0x23/0x60
[<ffffffff81382761>] __genl_unregister_mc_group+0x71/0x120
[<ffffffff81382866>] genl_unregister_family+0x56/0x130
[<ffffffffa0007d85>] nl80211_exit+0x15/0x20 [cfg80211]
[<ffffffffa000005a>] cfg80211_exit+0x1a/0x40 [cfg80211]
Fix in the same way by grabbing the netlink table lock
before doing rcu_read_lock().
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems recursion field from "struct ip_tunnel" is not anymore needed.
recursion prevention is done at the upper level (in dev_queue_xmit()),
since we use HARD_TX_LOCK protection for tunnels.
This avoids a cache line ping pong on "struct ip_tunnel" : This structure
should be now mostly read on xmit and receive paths.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we ever implement this, then we can stop returning an error.
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'drm-intel-next' of git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel: (57 commits)
drm/i915: Handle ERESTARTSYS during page fault
drm/i915: Warn before mmaping a purgeable buffer.
drm/i915: Track purged state.
drm/i915: Remove eviction debug spam
drm/i915: Immediately discard any backing storage for uneeded objects
drm/i915: Do not mis-classify clean objects as purgeable
drm/i915: Whitespace correction for madv
drm/i915: BUG_ON page refleak during unbind
drm/i915: Search harder for a reusable object
drm/i915: Clean up evict from list.
drm/i915: Add tracepoints
drm/i915: framebuffer compression for GM45+
drm/i915: split display functions by chip type
drm/i915: Skip the sanity checks if the current relocation is valid
drm/i915: Check that the relocation points to within the target
drm/i915: correct FBC update when pipe base update occurs
drm/i915: blacklist Acer AspireOne lid status
ACPI: make ACPI button funcs no-ops if not built in
drm/i915: prevent FIFO calculation overflows on 32 bits with high dotclocks
drm/i915: intel_display.c handle latency variable efficiently
...
Fix up trivial conflicts in drivers/gpu/drm/i915/{i915_dma.c|i915_drv.h}
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
module: don't call percpu_modfree on NULL pointer.
module: fix memory leak when load fails after srcversion/version allocated
module: preferred way to use MODULE_AUTHOR
param: allow whitespace as kernel parameter separator
module: reduce string table for loaded modules (v2)
module: reduce symbol table for loaded modules (v2)
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
truncate: use new helpers
truncate: new helpers
fs: fix overflow in sys_mount() for in-kernel calls
fs: Make unload_nls() NULL pointer safe
freeze_bdev: grab active reference to frozen superblocks
freeze_bdev: kill bd_mount_sem
exofs: remove BKL from super operations
fs/romfs: correct error-handling code
vfs: seq_file: add helpers for data filling
vfs: remove redundant position check in do_sendfile
vfs: change sb->s_maxbytes to a loff_t
vfs: explicitly cast s_maxbytes in fiemap_check_ranges
libfs: return error code on failed attr set
seq_file: return a negative error code when seq_path_root() fails.
vfs: optimize touch_time() too
vfs: optimization for touch_atime()
vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it
fs/inode.c: add dev-id and inode number for debugging in init_special_inode()
libfs: make simple_read_from_buffer conventional
For the longest time now we've been using multiple MODULE_AUTHOR()
statements when a module has more than one author, but the comment here
disagrees.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Luciano Coelho <luciano.coelho@nokia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Also remove all parts of the string table (referenced by the symbol
table) that are not needed for kallsyms use (i.e. which were only
referenced by symbols discarded by the previous patch, or not
referenced at all for whatever reason).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Discard all symbols not interesting for kallsyms use: absolute,
section, and in the common case (!KALLSYMS_ALL) data ones.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* 'for-linus' of git://neil.brown.name/md: (97 commits)
md: raid-1/10: fix RW bits manipulation
md: remove unnecessary memset from multipath.
md: report device as congested when suspended
md: Improve name of threads created by md_register_thread
md: remove sparse warnings about lock context.
md: remove sparse waring "symbol xxx shadows an earlier one"
async_tx/raid6: add missing dma_unmap calls to the async fail case
ioat3: fix uninitialized var warnings
drivers/dma/ioat/dma_v2.c: fix warnings
raid6test: fix stack overflow
ioat2: clarify ring size limits
md/raid6: cleanup ops_run_compute6_2
md/raid6: eliminate BUG_ON with side effect
dca: module load should not be an error message
ioat: driver version 4.0
dca: registering requesters in multiple dca domains
async_tx: remove HIGHMEM64G restriction
dmaengine: sh: Add Support SuperH DMA Engine driver
dmaengine: Move all map_sg/unmap_sg for slave channel to its client
fsldma: Add DMA_SLAVE support
...
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...
Because the binfmt is not different between threads in the same process,
it can be moved from task_struct to mm_struct. And binfmt moudle is
handled per mm_struct instead of task_struct.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When unaligned accesses are required for uncompressing a kernel (such as
for LZO decompression on ARM in a patch that follows), including
<linux/kernel.h> causes issues as it brings in a lot of things that are
not available in the decompression environment.
linux/kernel.h brings at least:
extern int console_printk[];
extern const char hex_asc[];
which causes errors at link-time as they are not available when
compiling the pre-boot environement. There are also a few others:
arch/arm/boot/compressed/misc.o: In function `valid_user_regs':
arch/arm/include/asm/ptrace.h:158: undefined reference to `elf_hwcap'
arch/arm/boot/compressed/misc.o: In function `console_silent':
include/linux/kernel.h:292: undefined reference to `console_printk'
arch/arm/boot/compressed/misc.o: In function `console_verbose':
include/linux/kernel.h:297: undefined reference to `console_printk'
arch/arm/boot/compressed/misc.o: In function `pack_hex_byte':
include/linux/kernel.h:360: undefined reference to `hex_asc'
arch/arm/boot/compressed/misc.o: In function `hweight_long':
include/linux/bitops.h:45: undefined reference to `hweight32'
arch/arm/boot/compressed/misc.o: In function `__cmpxchg_local_generic':
include/asm-generic/cmpxchg-local.h:21: undefined reference to `wrong_size_cmpxchg'
include/asm-generic/cmpxchg-local.h:42: undefined reference to `wrong_size_cmpxchg'
arch/arm/boot/compressed/misc.o: In function `__xchg':
arch/arm/include/asm/system.h:309: undefined reference to `__bad_xchg'
However, those files apparently use nothing from <linux/kernel.h>, all
they need is the declaration of types such as u32 or u64, so
<linux/types.h> should be enough
Signed-off-by: Albin Tonnerre <albin.tonnerre@free-electrons.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Phillip Lougher <phillip@lougher.demon.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->ioctx_lock and ->ioctx_list are used only under CONFIG_AIO.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The forward decls for some kernel types are only needed by the code behind
__KERNEL__, so don't bleed these types to userspace.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's unused.
It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.
It _was_ used in two places at arch/frv for some reason.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__fatal_signal_pending inlines to one instruction on x86, probably two
instructions on other machines. It takes two longer x86 instructions just
to call it and test its return value, not to mention the function itself.
On my random x86_64 config, this saved 70 bytes of text (59 of those being
__fatal_signal_pending itself).
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to direct the SIGIO signal to a particular thread of a
multi-threaded application we cannot, like suggested by the manpage, put a
TID into the regular fcntl(F_SETOWN) call. It will still be send to the
whole process of which that thread is part.
Since people do want to properly direct SIGIO we introduce F_SETOWN_EX.
The need to direct SIGIO comes from self-monitoring profiling such as with
perf-counters. Perf-counters uses SIGIO to notify that new sample data is
available. If the signal is delivered to the same task that generated the
new sample it can augment that data by inspecting the task's user-space
state right after it returns from the kernel. This is esp. convenient
for interpreted or virtual machine driven environments.
Both F_SETOWN_EX and F_GETOWN_EX take a pointer to a struct f_owner_ex
as argument:
struct f_owner_ex {
int type;
pid_t pid;
};
Where type is one of F_OWNER_TID, F_OWNER_PID or F_OWNER_GID.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: stephane eranian <eranian@googlemail.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce do_send_sig_info() and convert group_send_sig_info(),
send_sig_info(), do_send_specific() to use this helper.
Hopefully it will have more users soon, it allows to specify
specific/group behaviour via "bool group" argument.
Shaves 80 bytes from .text.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stephane eranian <eranian@googlemail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_delete_module() can set MODULE_STATE_GOING after
search_binary_handler() does try_module_get(). In this case
set_binfmt()->try_module_get() fails but since none of the callers
check the returned error, the task will run with the wrong old
->binfmt.
The proper fix should change all ->load_binary() methods, but we can
rely on fact that the caller must hold a reference to binfmt->module
and use __module_get() which never fails.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes tracehook_notify_jctl() so it's called with the siglock held,
and changes its argument and return value definition. These clean-ups
make it a better fit for what new tracing hooks need to check.
Tracing needs the siglock here, held from the time TASK_STOPPED was set,
to avoid potential SIGCONT races if it wants to allow any blocking in its
tracing hooks.
This also folds the finish_stop() function into its caller
do_signal_stop(). The function is short, called only once and only
unconditionally. It aids readability to fold it in.
[oleg@redhat.com: do not call tracehook_notify_jctl() in TASK_STOPPED state]
[oleg@redhat.com: introduce tracehook_finish_jctl() helper]
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The bug is old, it wasn't cause by recent changes.
Test case:
static void *tfunc(void *arg)
{
int pid = (long)arg;
assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0);
kill(pid, SIGKILL);
sleep(1);
return NULL;
}
int main(void)
{
pthread_t th;
long pid = fork();
if (!pid)
pause();
signal(SIGCHLD, SIG_IGN);
assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0);
int r = waitpid(-1, NULL, __WNOTHREAD);
printf("waitpid: %d %m\n", r);
return 0;
}
Before the patch this program hangs, after this patch waitpid() correctly
fails with errno == -ECHILD.
The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its
->real_parent is our sub-thread and we ignore SIGCHLD. But in this case
we should wake up other threads which can sleep in do_wait().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement reclaim from groups over their soft limit
Permit reclaim from memory cgroups on contention (via the direct reclaim
path).
memory cgroup soft limit reclaim finds the group that exceeds its soft
limit by the largest number of pages and reclaims pages from it and then
reinserts the cgroup into its correct place in the rbtree.
Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
loops in case all swap is turned off. The code has been refactored and
the loop check (loop < 2) has been enhanced for soft limits. For soft
limits, we try to do more targetted reclaim. Instead of bailing out after
two loops, the routine now reclaims memory proportional to the size by
which the soft limit is exceeded. The proportion has been empirically
determined.
[akpm@linux-foundation.org: build fix]
[kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
[nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Organize cgroups over soft limit in a RB-Tree
Introduce an RB-Tree for storing memory cgroups that are over their soft
limit. The overall goal is to
1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
We are careful about updates, updates take place only after a particular
time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
limit
The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.
[hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an interface to allow get/set of soft limits. Soft limits for memory
plus swap controller (memsw) is currently not supported. Resource
counters have been enhanced to support soft limits and new type
RES_SOFT_LIMIT has been added. Unlike hard limits, soft limits can be
directly set and do not need any reclaim or checks before setting them to
a newer value.
Kamezawa-San raised a question as to whether soft limit should belong to
res_counter. Since all resources understand the basic concepts of hard
and soft limits, it is justified to add soft limits here. Soft limits are
a generic resource usage feature, even file system quotas support soft
limits.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the memory cgroup to remove the overhead associated with accounting
all pages in the root cgroup. As a side-effect, we can no longer set a
memory hard limit in the root cgroup.
A new flag to track whether the page has been accounted or not has been
added as well. Flags are now set atomically for page_cgroup,
pcg_default_flags is now obsolete and removed.
[akpm@linux-foundation.org: fix a few documentation glitches]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alter the ss->can_attach and ss->attach functions to be able to deal with
a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
pre-patch to cgroup-procs-writable.patch.)
Currently, new mode of the attach function can only tell the subsystem
about the old cgroup of the threadgroup leader. No subsystem currently
needs that information for each thread that's being moved, but if one were
to be added (for example, one that counts tasks within a group) this bit
would need to be reworked a bit to tell the subsystem the right
information.
[hidave.darkstar@gmail.com: fix build]
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changes css_set freeing mechanism to be under RCU
This is a prepatch for making the procs file writable. In order to free the
old css_sets for each task to be moved as they're being moved, the freeing
mechanism must be RCU-protected, or else we would have to have a call to
synchronize_rcu() for each task before freeing its old css_set.
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously there was the problem in which two processes from different pid
namespaces reading the tasks or procs file could result in one process
seeing results from the other's namespace. Rather than one pidlist for
each file in a cgroup, we now keep a list of pidlists keyed by namespace
and file type (tasks versus procs) in which entries are placed on demand.
Each pidlist has its own lock, and that the pidlists themselves are passed
around in the seq_file's private pointer means we don't have to touch the
cgroup or its master list except when creating and destroying entries.
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct cgroup used to have a bunch of fields for keeping track of the
pidlist for the tasks file. Those are now separated into a new struct
cgroup_pidlist, of which two are had, one for procs and one for tasks.
The way the seq_file operations are set up is changed so that just the
pidlist struct gets passed around as the private data.
Interface example: Suppose a multithreaded process has pid 1000 and other
threads with ids 1001, 1002, 1003:
$ cat tasks
1000
1001
1002
1003
$ cat cgroup.procs
1000
$
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following series adds a "cgroup.procs" file to each cgroup that
reports unique tgids rather than pids, and allows all threads in a
threadgroup to be atomically moved to a new cgroup.
The subsystem "attach" interface is modified to support attaching whole
threadgroups at a time, which could introduce potential problems if any
subsystem were to need to access the old cgroup of every thread being
moved. The attach interface may need to be revised if this becomes the
case.
Also added is functionality for read/write locking all CLONE_THREAD
fork()ing within a threadgroup, by means of an rwsem that lives in the
sighand_struct, for per-threadgroup-ness and also for sharing a cacheline
with the sighand's atomic count. This scheme should introduce no extra
overhead in the fork path when there's no contention.
The final patch reveals potential for a race when forking before a
subsystem's attach function is called - one potential solution in case any
subsystem has this problem is to hang on to the group's fork mutex through
the attach() calls, though no subsystem yet demonstrates need for an
extended critical section.
This patch:
Revert
commit 096b7fe012
Author: Li Zefan <lizf@cn.fujitsu.com>
AuthorDate: Wed Jul 29 15:04:04 2009 -0700
Commit: Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Wed Jul 29 19:10:35 2009 -0700
cgroups: fix pid namespace bug
This is in preparation for some clashing cgroups changes that subsume the
original commit's functionaliy.
The original commit fixed a pid namespace bug which Ben Blum fixed
independently (in the same way, but with different code) as part of a
series of patches. I played around with trying to reconcile Ben's patch
series with Li's patch, but concluded that it was simpler to just revert
Li's, given that Ben's patch series contained essentially the same fix.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are many similar code in kernel for one object: convert time between
calendar time and broken-down time.
Here is some source I found:
fs/ncpfs/dir.c
fs/smbfs/proc.c
fs/fat/misc.c
fs/udf/udftime.c
fs/cifs/netmisc.c
net/netfilter/xt_time.c
drivers/scsi/ips.c
drivers/input/misc/hp_sdc_rtc.c
drivers/rtc/rtc-lib.c
arch/ia64/hp/sim/boot/fw-emu.c
arch/m68k/mac/misc.c
arch/powerpc/kernel/time.c
arch/parisc/include/asm/rtc.h
...
We can make a common function for this type of conversion, At least we
can get following benefit:
1: Make kernel simple and unify
2: Easy to fix bug in converting code
3: Reduce clone of code in future
For example, I'm trying to make ftrace display walltime,
this patch will make me easy.
This code is based on code from glibc-2.6
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
into mm/truncate.c.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently we held s_umount while a filesystem is frozen, despite that we
might return to userspace and unlock it from a different process. Instead
grab an active reference to keep the file system busy and add an explicit
check for frozen filesystems in remount and reject the remount instead
of blocking on s_umount.
Add a new get_active_super helper to super.c for use by freeze_bdev that
grabs an active reference to a superblock from a given block device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that we have the freeze count there is not much reason for bd_mount_sem
anymore. The actual freeze/thaw operations are serialized using the
bd_fsfreeze_mutex, and the only other place we take bd_mount_sem is
get_sb_bdev which tries to prevent mounting a filesystem while the block
device is frozen. Instead of add a check for bd_fsfreeze_count and
return -EBUSY if a filesystem is frozen. While that is a change in user
visible behaviour a failing mount is much better for this case rather
than having the mount process stuck uninterruptible for a long time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add two helpers that allow access to the seq_file's own buffer, but
hide the internal details of seq_files.
This allows easier implementation of special purpose filling
functions. It also cleans up some existing functions which duplicated
the seq_file logic.
Make these inline functions in seq_file.h, as suggested by Al.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
sb->s_maxbytes is supposed to indicate the maximum size of a file that can
exist on the filesystem. It's declared as an unsigned long long.
Even if a filesystem has no inherent limit that prevents it from using
every bit in that unsigned long long, it's still problematic to set it to
anything larger than MAX_LFS_FILESIZE. There are places in the kernel
that cast s_maxbytes to a signed value. If it's set too large then this
cast makes it a negative number and generally breaks the comparison.
Change s_maxbytes to be loff_t instead. That should help eliminate the
temptation to set it too large by making it a signed value.
Also, add a warning for couple of releases to help catch filesystems that
set s_maxbytes too large. Eventually we can either convert this to a
BUG() or just remove it and in the hope that no one will get it wrong now
that it's a signed value.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Robert Love <rlove@google.com>
Cc: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>