Commit graph

24256 commits

Author SHA1 Message Date
Serge E. Hallyn
59607db367 userns: add a user_namespace as creator/owner of uts_namespace
The expected course of development for user namespaces targeted
capabilities is laid out at https://wiki.ubuntu.com/UserNamespace.

Goals:

- Make it safe for an unprivileged user to unshare namespaces.  They
  will be privileged with respect to the new namespace, but this should
  only include resources which the unprivileged user already owns.

- Provide separate limits and accounting for userids in different
  namespaces.

Status:

  Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to
  get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and
  CAP_SETGID capabilities.  What this gets you is a whole new set of
  userids, meaning that user 500 will have a different 'struct user' in
  your namespace than in other namespaces.  So any accounting information
  stored in struct user will be unique to your namespace.

  However, throughout the kernel there are checks which

  - simply check for a capability.  Since root in a child namespace
    has all capabilities, this means that a child namespace is not
    constrained.

  - simply compare uid1 == uid2.  Since these are the integer uids,
    uid 500 in namespace 1 will be said to be equal to uid 500 in
    namespace 2.

  As a result, the lxc implementation at lxc.sf.net does not use user
  namespaces.  This is actually helpful because it leaves us free to
  develop user namespaces in such a way that, for some time, user
  namespaces may be unuseful.

Bugs aside, this patchset is supposed to not at all affect systems which
are not actively using user namespaces, and only restrict what tasks in
child user namespace can do.  They begin to limit privilege to a user
namespace, so that root in a container cannot kill or ptrace tasks in the
parent user namespace, and can only get world access rights to files.
Since all files currently belong to the initila user namespace, that means
that child user namespaces can only get world access rights to *all*
files.  While this temporarily makes user namespaces bad for system
containers, it starts to get useful for some sandboxing.

I've run the 'runltplite.sh' with and without this patchset and found no
difference.

This patch:

copy_process() handles CLONE_NEWUSER before the rest of the namespaces.
So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace
will have the new user namespace as its owner.  That is what we want,
since we want root in that new userns to be able to have privilege over
it.

Changelog:
	Feb 15: don't set uts_ns->user_ns if we didn't create
		a new uts_ns.
	Feb 23: Move extern init_user_ns declaration from
		init/version.c to utsname.h.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:59 -07:00
Eric W. Biederman
45a68628d3 pid: remove the child_reaper special case in init/main.c
This patchset is a cleanup and a preparation to unshare the pid namespace.
These prerequisites prepare for Eric's patchset to give a file descriptor
to a namespace and join an existing namespace.

This patch:

It turns out that the existing assignment in copy_process of the
child_reaper can handle the initial assignment of child_reaper we just
need to generalize the test in kernel/fork.c

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:57 -07:00
Alexandre Bounine
569fccb6b4 rapidio: modify mport ID assignment
Changes mport ID and host destination ID assignment to implement unified
method common to all mport drivers.  Makes "riohdid=" kernel command line
parameter common for all architectures with support for more that one host
destination ID assignment.

Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Thomas Moll <thomas.moll@sysgo.com>
Cc: Micha Nelissen <micha@neli.hopto.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:43 -07:00
Alexandre Bounine
2f809985d2 rapidio: modify subsystem and driver initialization sequence
Subsystem initialization sequence modified to support presence of multiple
RapidIO controllers in the system.  The new sequence is compatible with
initialization of PCI devices.

Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Thomas Moll <thomas.moll@sysgo.com>
Cc: Micha Nelissen <micha@neli.hopto.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:42 -07:00
Alexandre Bounine
f8f0626989 rapidio: add architecture specific callbacks
This set of patches eliminates RapidIO dependency on PowerPC architecture
and makes it available to other architectures (x86 and MIPS).  It also
enables support of new platform independent RapidIO controllers such as
PCI-to-SRIO and PCI Express-to-SRIO.

This patch:

Extend number of mport callback functions to eliminate direct linking of
architecture specific mport operations.

Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Thomas Moll <thomas.moll@sysgo.com>
Cc: Micha Nelissen <micha@neli.hopto.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:41 -07:00
Alexey Dobriyan
312ec7e50c proc: make struct proc_dir_entry::namelen unsigned int
1. namelen is declared "unsigned short" which hints for "maybe space savings".
   Indeed in 2.4 struct proc_dir_entry looked like:

        struct proc_dir_entry {
                unsigned short low_ino;
                unsigned short namelen;

   Now, low_ino is "unsigned int", all savings were gone for a long time.
   "struct proc_dir_entry" is not that countless to worry about it's size,
   anyway.

2. converting from unsigned short to int/unsigned int can only create
   problems, we better play it safe.

Space is not really conserved, because of natural alignment for the next
field.  sizeof(struct proc_dir_entry) remains the same.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:37 -07:00
Johannes Weiner
7ffd4ca7a2 memcg: convert uncharge batching from bytes to page granularity
We never uncharge subpage quantities.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:30 -07:00
Johannes Weiner
6b3ae58efc memcg: remove direct page_cgroup-to-page pointer
In struct page_cgroup, we have a full word for flags but only a few are
reserved.  Use the remaining upper bits to encode, depending on
configuration, the node or the section, to enable page_cgroup-to-page
lookups without a direct pointer.

This saves a full word for every page in a system with memory cgroups
enabled.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:28 -07:00
Johannes Weiner
de3638d9cd memcg: fold __mem_cgroup_move_account into caller
It is one logical function, no need to have it split up.

Also, get rid of some checks from the inner function that ensured the
sanity of the outer function.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:27 -07:00
Johannes Weiner
97a6c37b34 memcg: change page_cgroup_zoneinfo signature
Instead of passing a whole struct page_cgroup to this function, let it
take only what it really needs from it: the struct mem_cgroup and the
page.

This has the advantage that reading pc->mem_cgroup is now done at the same
place where the ordering rules for this pointer are enforced and
explained.

It is also in preparation for removing the pc->page backpointer.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:26 -07:00
Daisuke Nishimura
f212ad7cf9 memcg: add memcg sanity checks at allocating and freeing pages
Add checks at allocating or freeing a page whether the page is used (iow,
charged) from the view point of memcg.

This check may be useful in debugging a problem and we did similar checks
before the commit 52d4b9ac(memcg: allocate all page_cgroup at boot).

This patch adds some overheads at allocating or freeing memory, so it's
enabled only when CONFIG_DEBUG_VM is enabled.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:25 -07:00
Johannes Weiner
9d11ea9f16 memcg: simplify the way memory limits are checked
Since transparent huge pages, checking whether memory cgroups are below
their limits is no longer enough, but the actual amount of chargeable
space is important.

To not have more than one limit-checking interface, replace
memory_cgroup_check_under_limit() and memory_cgroup_check_margin() with a
single memory_cgroup_margin() that returns the chargeable space and leaves
the comparison to the callsite.

Soft limits are now checked the other way round, by using the already
existing function that returns the amount by which soft limits are
exceeded: res_counter_soft_limit_excess().

Also remove all the corresponding functions on the res_counter side that
are now no longer used.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:23 -07:00
Johannes Weiner
b7c6167848 memcg: soft limit reclaim should end at limit not below
Soft limit reclaim continues until the usage is below the current soft
limit, but the documented semantics are actually that soft limit reclaim
will push usage back until the soft limits are met again.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:23 -07:00
Akinobu Mita
b9b9144a53 reiserfs: use little-endian bitops
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h.  This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:18 -07:00
Akinobu Mita
0795ccea24 ext3: use little-endian bitops
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h.  This converts ext2 non-atomic bit operations to
little-endian bit operations.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 19:46:17 -07:00
Rafael J. Wysocki
d47d81c0e9 Introduce ARCH_NO_SYSDEV_OPS config option (v2)
Introduce Kconfig option allowing architectures where sysdev
operations used during system suspend, resume and shutdown have been
completely replaced with struct sycore_ops operations to avoid
building sysdev code that will never be used.

Make callbacks in struct sys_device and struct sysdev_driver depend
on ARCH_NO_SYSDEV_OPS to allows us to verify if all of the references
have been actually removed from the code the given architecture
depends on.

Make x86 select ARCH_NO_SYSDEV_OPS.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-03-23 22:16:41 +01:00
Grant Likely
1eed4c077c dt: eliminate OF_NO_DEEP_PROBE and test for NULL match table
There are no users of OF_NO_DEEP_PROBE, and of_match_node() now
gracefully handles being passed a NULL pointer, so the checks at the
top of of_platform_bus_probe can be dropped.

While at it, consolidate the root node pointer check to be easier to
read and tidy up related comments.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2011-03-23 14:55:56 -06:00
Stephen Wilson
5ddd36b9c5 mm: implement access_remote_vm
Provide an alternative to access_process_vm that allows the caller to obtain a
reference to the supplied mm_struct.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:57 -04:00
Stephen Wilson
cae5d39032 mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
Now that gate vma's are referenced with respect to a particular mm and not a
particular task it only makes sense to propagate the change to this predicate as
well.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:55 -04:00
Stephen Wilson
83b964bbf8 mm: arch: make in_gate_area take an mm_struct instead of a task_struct
Morally, the question of whether an address lies in a gate vma should be asked
with respect to an mm, not a particular task.  Moreover, dropping the dependency
on task_struct will help make existing and future operations on mm's more
flexible and convenient.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:54 -04:00
Stephen Wilson
31db58b3ab mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
Morally, the presence of a gate vma is more an attribute of a particular mm than
a particular task.  Moreover, dropping the dependency on task_struct will help
make both existing and future operations on mm's more flexible and convenient.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-23 16:36:54 -04:00
Andy Adamson
863a3c6c68 NFSv4.1: layoutcommit
The filelayout driver sends LAYOUTCOMMIT only when COMMIT goes to
the data server (as opposed to the MDS) and the data server WRITE
is not NFS_FILE_SYNC.

Only whole file layout support means that there is only one IOMODE_RW layout
segment.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
Tested-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-03-23 15:29:04 -04:00
Fred Isaman
e0c2b38018 NFSv4.1: filelayout driver specific code for COMMIT
Implement all the hooks created in the previous patches.
This requires exporting quite a few functions and adding a few
structure fields.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-03-23 15:29:04 -04:00
Fred Isaman
a861a1e1c3 NFSv4.1: add generic layer hooks for pnfs COMMIT
We create three major hooks for the pnfs code.

pnfs_mark_request_commit() is called during writeback_done from
nfs_mark_request_commit, which gives the driver an opportunity to
claim it wants control over commiting a particular req.

pnfs_choose_commit_list() is called from nfs_scan_list
to choose which list a given req should be added to, based on
where we intend to send it for COMMIT.  It is up to the driver
to have preallocated list headers for each destination it may need.

pnfs_commit_list() is how the driver actually takes control, it is
used instead of nfs_commit_list().

In order to pass information between the above functions, we create
a union in nfs_page to hold a lseg (which is possible because the req is
not on any list while in transition), and add some flags to indicate
if we need to use the pnfs code.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-03-23 15:29:03 -04:00
Eli Cohen
c1b43dca13 mlx4: Add blue flame support for kernel consumers
Using blue flame can improve latency by allowing the HW to more efficiently
access the WQE. This patch presents two functions that are used to allocate or
release HW resources for using blue flame; the caller need to supply a struct
mlx4_bf object when allocating resources. Consumers that make use of this API
should post doorbells to the UAR object pointed by the initialized struct
mlx4_bf;

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-23 12:24:23 -07:00
Yevgeny Petrilin
1679200f91 mlx4_en: Enabling new steering
The mlx4_en module now uses the new steering mechanism.
The RX packets are now steered through the MCG table instead
of Mac table for unicast, and default entry for multicast.
The feature is enabled through INIT_HCA

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-23 12:24:22 -07:00
Yevgeny Petrilin
0345584e0b mlx4: generalization of multicast steering.
The same packet steering mechanism would be used both for IB and Ethernet,
Both multicasts and unicasts.
This commit prepares the general infrastructure for this.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-23 12:24:21 -07:00
Yevgeny Petrilin
725c89997e mlx4_en: Reporting HW revision in ethtool -i
HW revision is derived from device ID and rev id.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-23 12:24:20 -07:00
Yevgeny Petrilin
14c07b1358 mlx4: Wake on LAN support
The driver queries the FW for WOL support.
Ethtool get/set_wol is implemented accordingly.
Only magic packets are supported at the time.

Signed-off-by: Igor Yarovinsky <igory@mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-23 12:24:19 -07:00
Yevgeny Petrilin
0b7ca5a928 mlx4: Changing interrupt scheme
Adding a pool of MSI-X vectors and EQs that can be used explicitly by mlx4_core
customers (mlx4_ib, mlx4_en). The consumers will assign their own names to the
interrupt vectors. Those vectors are not opened at mlx4 device initialization,
opened by demand.
Changed the max number of possible EQs according to the new scheme, no longer relies on
on number of cores.
The new functionality is exposed through mlx4_assign_eq() and mlx4_release_eq().
Customers that do not use the new API will get completion vectors as before.

Signed-off-by: Markuze Alex <markuze@mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-23 12:24:18 -07:00
Thomas Gleixner
a2e8461a2c genirq: Provide locked setter for chip, handler, name
Some irq_set_type() callbacks need to change the chip and the handler
when the trigger mode changes. We have already a (misnomed) setter
function for the handler which can be called from irq_set_type().

Provide one which allows to set chip and name as well. Put the
misnomed function under the COMPAT switch and provide a replacement.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-03-23 20:22:06 +01:00
Thomas Gleixner
d3e17deb17 genirq: Provide a lockdep helper
Some irq chips need to call genirq functions for nested chips from
their callbacks. That upsets lockdep. So they need to set a different
lock class for those nested chips. Provide a helper function to avoid
open access to irq_desc.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-03-23 20:22:06 +01:00
Thomas Gleixner
3b90389128 genirq; Remove the last leftovers of the old sparse irq code
All users converted. Get rid of it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-03-23 20:22:06 +01:00
Bryan Schumaker
8ef2ce3e16 NFS: Detect loops in a readdir due to bad cookies
Some filesystems (such as ext4) can return the same cookie value for
multiple files.  If we try to start a readdir with one of these cookies,
the server will return the first file found with a cookie of the same
value.  This can cause the client to enter an infinite loop.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-03-23 15:14:27 -04:00
Bryan Schumaker
480c2006eb NFS: Create nfs_open_dir_context
nfs_opendir() created a context that held much more information than we
need for a readdir.  This patch introduces a slimmed-down
nfs_open_dir_context that contains only the cookie and the cred used for
RPC operations.  The new context will eventually be used to help detect
readdir loops.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-03-23 15:13:11 -04:00
Stephane Eranian
68cacd2916 perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()
This patch solves a stale pointer problem in
update_cgrp_time_from_cpuctx(). The cpuctx->cgrp
was not cleared on all possible event exit paths,
including:

   close()
     perf_release()
       perf_release_kernel()
         list_del_event()

This patch fixes list_del_event() to clear cpuctx->cgrp
when there are no cgroup events left in the context.

[ This second version makes the code compile when
  CONFIG_CGROUP_PERF is not enabled. We unconditionally define
  perf_cpu_context->cgrp. ]

Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: perfmon2-devel@lists.sf.net
Cc: paulus@samba.org
Cc: davem@davemloft.net
LKML-Reference: <20110323150306.GA1580@quad>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-23 16:07:22 +01:00
Heiko Carstens
04948c7f80 smp: add missing init.h include
Commit 34db18a054 ("smp: move smp setup functions to kernel/smp.c")
causes this build error on s390 because of a missing init.h include:

  CC      arch/s390/kernel/asm-offsets.s
  In file included from /home2/heicarst/linux-2.6/arch/s390/include/asm/spinlock.h:14:0,
  from include/linux/spinlock.h:87,
  from include/linux/seqlock.h:29,
  from include/linux/time.h:8,
  from include/linux/timex.h:56,
  from include/linux/sched.h:57,
  from arch/s390/kernel/asm-offsets.c:10:
  include/linux/smp.h:117:20: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'setup_nr_cpu_ids'
  include/linux/smp.h:118:20: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'smp_init'

Fix it by adding the include statement.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: WANG Cong <amwang@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23 07:48:42 -07:00
Jonathan Neuschäfer
e815f0a84f sched.h: Fix a typo ("its")
The sentence uses the possessive pronoun, which is spelled
without an apostrophe.

Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Cc: Jiri Kosina <trivial@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1300735487-2406-1-git-send-email-j.neuschaefer@gmx.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-23 12:10:02 +01:00
Linus Walleij
8bd4d7c4c5 mfd: Rename ab8500 gpadc header
Rename AB8500 GPADC header so as not to be redunantly named.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:11 +01:00
Mark Brown
07e73fbb2d mfd: Constify WM8994 write path
Allow const buffers to be passed in without type safety issues.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:11 +01:00
Linus Walleij
2edd3b6924 regulator: Add a subdriver for TI TPS6105x regulator portions v2
This adds a subdriver for the regulator found inside the TPS61050
and TPS61052 chips.

Cc: Samuel Ortiz <samuel.ortiz@intel.com>
Cc: Ola Lilja <ola.o.lilja@stericsson.com>
Cc: Jonas Aberg <jonas.aberg@stericsson.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:09 +01:00
Linus Walleij
798a8eee44 mfd: Add a core driver for TI TPS61050/TPS61052 chips v2
The TPS61050/TPS61052 are boost converters, LED drivers, LED flash
drivers and a simple GPIO pin chips.

Cc: Liam Girdwood <lrg@slimlogic.co.uk>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Cc: Jonas Aberg <jonas.aberg@stericsson.com>
Cc: Ola Lilja <ola.o.lilja@stericsson.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:09 +01:00
Denis Turischev
c4fdd1163a pci_ids: Add Intel Tunnel Creek LPC Bridge device ID.
Signed-off-by: Denis Turischev <denis@compulab.co.il>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:08 +01:00
MyungJoo Ham
bd6ca2cf50 regulator: MAX8997/8966 support
This patch supports PMIC/Regulator part of MAX8997/MAX8966 MFD.
In this initial release, selecting voltages or current-limit
and switching on/off the regulators are supported.

Controlling voltages for DVS with GPIOs is not implemented fully
and requires more considerations: it controls multiple bucks (selection
of 1, 2, and 5) at the same time with SET1~3 gpios. Thus, when DVS-GPIO
is activated, we lose the ability to control the voltage of a single
buck regulator independently; i.e., contolling a buck affects other two
bucks. Therefore, using the conventional regulator framework directly
might be problematic. However, in this driver, we try to choose
a setting without such side effect of affecting other regulators and
then try to choose a setting with the minimum side effect (the sum of
voltage changes in other regulators).

On the other hand, controlling all the three bucks simultenously based
on the voltage set table may help build cpufreq and similar system
more robust; i.e., all the three voltages are consistent every time
without glitches during transition.

Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:07 +01:00
Mark Brown
e93c53870c mfd: Add WM8994 bulk register write operation
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:07 +01:00
Haojian Zhuang
09b034191a mfd: Append additional read write on 88pm860x
Append the additional read/write operation on 88pm860x for accessing
test page in 88PM860x.

Signed-off-by: Haojian Zhuang <haojian.zhuang@marvell.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:07 +01:00
Haojian Zhuang
22aad0011e mfd: Adopt mfd_data in 88pm860x regulator
Copy 88pm860x platform data into different mfd_data structure for
regulator driver. So move the identification of device node from
regulator driver to mfd driver.

Signed-off-by: Haojian Zhuang <haojian.zhuang@marvell.com>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:06 +01:00
Haojian Zhuang
3154c34469 mfd: Adopt mfd_data in 88pm860x led
Copy 88pm860x platform data into different mfd_data structure for
led driver. So move the identification of device node from led
driver to mfd driver.

Signed-off-by: Haojian Zhuang <haojian.zhuang@marvell.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:06 +01:00
Haojian Zhuang
adb70483f4 mfd: Adopt mfd_data in 88pm860x backlight
Copy 88pm860x platform data into different mfd_data structure for
backlight driver. So move the identification of device node from
backlight driver to mfd driver.

Signed-off-by: Haojian Zhuang <haojian.zhuang@marvell.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:05 +01:00
Daniel Willerud
6321992cd3 mfd: Reentrance and revamp ab8500 gpadc fetching interface
This revamps the interface so that AB8500 GPADCs are fetched by
name. Probed GPADCs are added to a list and this list is searched
for a matching GPADC. This makes it possible to have multiple
AB8500 GPADC instances instead of it being a singleton, and
rids the need to keep a GPADC pointer around in the core AB8500
MFD struct.

Currently the match is made to the device name which is by default
numbered from the device instance such as "ab8500-gpadc.0" but
by using the .init_name field of the device a more intiutive
naming for the GPADC blocks can be achieved if desired.

Signed-off-by: Daniel Willerud <daniel.willerud@stericsson.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:04 +01:00
Daniel Willerud
cf16943947 mfd: Move ab8500 gpadc header to subdir
This moves the ab8500-gpadc.h header down into the ab8500/
subdir in include/linux/mfd and fixes some whitespace in the
header in the process.

Signed-off-by: Daniel Willerud <daniel.willerud@stericsson.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:03 +01:00
MyungJoo Ham
527e7e9a82 mfd: MAX8997/8966 support
MAX8997/MAX8966 chip is a multi-function device with I2C bussses. The
chip includes PMIC, RTC, Fuel Gauge, MUIC, Haptic, Flash control, and
Battery (charging) control.

This patch is an initial release of a MAX8997/8966 driver that supports
to enable the chip with its primary I2C bus that connects every device
mentioned above except for Fuel Gauge, which uses another I2C bus. The
fuel gauge is not supported by this mfd driver and is supported by a
seperated driver of MAX17042 Fuel Gauge (yes, the fuel gauge part is
compatible with MAX17042).

Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Reviewed-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:03 +01:00
Andres Salomon
f77289ac25 mfd: Rename mfd_shared_cell_{en,dis}able to drop the "shared" part
As requested by Samuel, there's not really any reason to have "shared"
in the name.

This also modifies the only user of the function, as well.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:03 +01:00
Mattias Wallin
adceed6263 mfd: ab8500 chip revision 3.0 support
This patch adds support for ab8500 chip revision cut 3.0.
Also rephrased from Changes to Author in the header.

Signed-off-by: Mattias Wallin <mattias.wallin@stericsson.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:02 +01:00
Mark Brown
93619c2106 mfd: Add platform data to support multiple WM831x devices per board
If a system contains multiple WM831x devices we need to pass a device
number through to the MFD so that we use unique device IDs when we
instantiate child devices. In order to get support for this into 2.6.39
add some platform data to support the configuration, but no implementation
as yet.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:01 +01:00
Keerthy
f99c1d4f94 mfd: Add twl4030 madc driver
Introducing a driver for MADC on TWL4030 powerIC. MADC stands for monitoring
ADC. This driver monitors the real time conversion of analog signals like
battery temperature, battery cuurent etc.

Signed-off-by: Keerthy <j-keerthy@ti.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:42:00 +01:00
Andres Salomon
a9bbba9963 mfd: add platform_device sharing support for mfd
This adds functions to enable platform_device sharing for mfd clients.

Each platform driver (mfd client) that wants to share an mfd_cell's
platform_device uses the mfd_shared_platform_driver_{un,}register()
functions instead of platform_driver_{un,}register().  Along with
registering the platform driver, these also register a new platform
device with the same characteristics as the original cell, but a different
name.  Given an mfd_cell with the name "foo", drivers that want to
share access to its resources can call mfd_shared_platform_driver_register
with platform drivers named (for example) "bar" and "baz".  This
will register two platform devices and drivers named "bar" and "baz"
that share the same cell as the platform device "foo".  The drivers
can then call "foo" cell's enable hooks (or mfd_shared_cell_enable)
to enable resources, and obtain platform resources as they normally
would.

This deals with platform handling only; mfd driver-specific details,
hardware handling, refcounting, etc are all dealt with separately.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:41:58 +01:00
Andres Salomon
1e29af62f2 mfd: Add refcounting support to mfd_cells
This provides convenience functions for sharing of cells across
multiple mfd clients.  Mfd drivers can provide enable/disable hooks
to actually tweak the hardware, and clients can call
mfd_shared_cell_{en,dis}able without having to worry about whether
or not another client happens to have enabled or disabled the
cell/hardware.

Note that this is purely optional; drivers can continue to use
the mfd_cell's enable/disable hooks for their own purposes, if
desired.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:41:58 +01:00
Andres Salomon
dcb50e83bb mfd: Remove driver_data field from mfd_cell
All users of this have now been switched over to using mfd_data;
it can go away now.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:41:58 +01:00
Andres Salomon
65e523595a mfd: Rename platform_data field of mfd_cell to mfd_data
Rename the platform_data variable to imply a distinction between
common platform_data driver usage (typically accessed via
pdev->dev.platform_data) and the way MFD passes data down to
clients (using a wrapper named mfd_get_data).

All clients have already been changed to use the wrapper function,
so this can be a quick single-commit change that only touches things
in drivers/mfd.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Acked-by: Linus Walleij <linus.walleij@stericsson.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:41:55 +01:00
Andres Salomon
40e03f571b mfd: Drop data_size from mfd_cell struct
Now that there are no more users of this, drop it.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:41:55 +01:00
Andres Salomon
4ec1b54c4d mfd: mfd_cell is now implicitly available to mc13xxx drivers
The cell's platform_data is now accessed with a helper function;
change clients to use that, and remove the now-unused data_size.

Note that mfd-core no longer makes a copy of platform_data, but the
mc13xxx-core driver creates the pdata structures on the stack.  In
order to get around that, the various ARM mach types that set the
pdata have been changed to hold the variable in static (global) memory.
Also note that __initdata references in aforementioned pdata structs
have been dropped.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:41:54 +01:00
Andres Salomon
0ce5fabe59 mfd: mfd_cell is now implicitly available to ab3550 driver
No clients (in mainline kernel, I'm told that drivers exist in external
trees that are planned for mainline inclusion) make use of this, nor
do they make use of platform_data, so nothing really had to change here.

The .data_size field is unused, so its usage gets removed.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:41:51 +01:00
Andres Salomon
fe891a008f mfd-core: Unconditionally add mfd_cell to every platform_device
Previously, one would set the mfd_cell's platform_data/data_size to point
to the current mfd_cell in order to pass that information along to drivers.

This causes the current mfd_cell to always be available to drivers.  It
also adds a wrapper function for fetching the mfd cell from a platform
device, similar to what originally existed for mfd devices.

Drivers who previously used platform_data for other purposes can still
use it; the difference is that mfd_get_data() must be used to
access it (and the pdata structure is no longer allocated in
mfd_add_devices).

Note that mfd_get_data is intentionally vague (in name) about where
the data is stored; variable name changes can come later without having
to touch brazillions of drivers.

Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:41:50 +01:00
Andres Salomon
2798e226ad mfd-core: Fix up typos/vagueness in comment
Signed-off-by: Andres Salomon <dilinger@queued.net>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:41:50 +01:00
Balaji T K
8e6de4a302 regulator: twl: add clk32kg to twl-regulator
In OMAP4 Blaze and Panda, 32KHz clock to WLAN is supplied from Phoenix
TWL6030. The 32KHz clock state (ON/OFF) is configured in
CLK32KG_CFG_[GRP, TRANS, STATE] register. This follows the same register
programming model as other regulators in TWL6030. So add CLK32KG as pseudo
regulator.

Signed-off-by: Balaji T K <balajitk@ti.com>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:41:48 +01:00
Arun Murthy
dae2db30c1 mfd: Add new ab8500 GPADC driver
AB8500 GPADC driver used to convert Acc and battery/ac/usb voltage

Signed-off-by: Arun Murthy <arun.murthy@stericsson.com>
Acked-by: Linus Walleij <linus.walleij@stericsson.com>
Acked-by: Mattias Wallin <mattias.wallin@stericsson.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:41:48 +01:00
Mattias Nilsson
90550d1903 mfd: AB8500 system control driver
This adds a pretty straight-forward system control driver for the
AB8500. This driver will be used from the core platform, e.g the
clock tree implementation in the machine code, and is by nature
singleton.

There are a few simple functions to read, write, set and clear
registers so that the machine code can control its own foundation.

Cc: Mattias Wallin <mattias.wallin@stericsson.com>
Signed-off-by: Mattias Nilsson <mattias.i.nilsson@stericsson.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:41:47 +01:00
Mark Brown
b103e0b3c5 mfd: Support configuration of WM831x /IRQ output in CMOS mode
Provide platform data allowing the system to set the /IRQ pin into
CMOS mode rather than the default open drain. The default value of
this platform data reflects the default hardware configuration so
there should be no change to existing users.

Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2011-03-23 10:41:44 +01:00
Len Brown
02e2407858 Merge branch 'linus' into release
Conflicts:
	arch/x86/kernel/acpi/sleep.c

Signed-off-by: Len Brown <len.brown@intel.com>
2011-03-23 02:34:54 -04:00
Len Brown
5c129a8600 Merge commit 'v2.6.38' into release 2011-03-23 02:33:54 -04:00
Sriram
6a1fef6d00 net: davinci_emac:Fix translation logic for buffer descriptor
With recent changes to the driver(switch to new cpdma layer),
the support for buffer descriptor address translation logic
is broken. This affects platforms where the physical address of
the descriptors as seen by the DMA engine is different from the
physical address.

Original Patch adding translation logic support:
Commit: ad021ae886

Signed-off-by: Sriramakrishnan A G <srk@ti.com>
Tested-By: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22 19:25:05 -07:00
Linus Torvalds
6447f55da9 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (66 commits)
  avr32: at32ap700x: fix typo in DMA master configuration
  dmaengine/dmatest: Pass timeout via module params
  dma: let IMX_DMA depend on IMX_HAVE_DMA_V1 instead of an explicit list of SoCs
  fsldma: make halt behave nicely on all supported controllers
  fsldma: reduce locking during descriptor cleanup
  fsldma: support async_tx dependencies and automatic unmapping
  fsldma: fix controller lockups
  fsldma: minor codingstyle and consistency fixes
  fsldma: improve link descriptor debugging
  fsldma: use channel name in printk output
  fsldma: move related helper functions near each other
  dmatest: fix automatic buffer unmap type
  drivers, pch_dma: Fix warning when CONFIG_PM=n.
  dmaengine/dw_dmac fix: use readl & writel instead of __raw_readl & __raw_writel
  avr32: at32ap700x: Specify DMA Flow Controller, Src and Dst msize
  dw_dmac: Setting Default Burst length for transfers as 16.
  dw_dmac: Allow src/dst msize & flow controller to be configured at runtime
  dw_dmac: Changing type of src_master and dest_master to u8.
  dw_dmac: Pass Channel Priority from platform_data
  dw_dmac: Pass Channel Allocation Order from platform_data
  ...
2011-03-22 17:53:13 -07:00
Jim Keniston
565d76cb7d zlib: slim down zlib_deflate() workspace when possible
Instead of always creating a huge (268K) deflate_workspace with the
maximum compression parameters (windowBits=15, memLevel=8), allow the
caller to obtain a smaller workspace by specifying smaller parameter
values.

For example, when capturing oops and panic reports to a medium with
limited capacity, such as NVRAM, compression may be the only way to
capture the whole report.  In this case, a small workspace (24K works
fine) is a win, whether you allocate the workspace when you need it (i.e.,
during an oops or panic) or at boot time.

I've verified that this patch works with all accepted values of windowBits
(positive and negative), memLevel, and compression level.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:17 -07:00
Konstantin Khlebnikov
d03e1617f0 crc32: add missed brackets in macro
Add brackets around typecasted argument in crc32() macro.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:15 -07:00
Mike Frysinger
e359dc24d3 sigma-firmware: loader for Analog Devices' SigmaStudio
Analog Devices' SigmaStudio can produce firmware blobs for devices with
these DSPs embedded (like some audio codecs).  Allow these device drivers
to easily parse and load them.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:15 -07:00
Alexey Dobriyan
33ee3b2e2e kstrto*: converting strings to integers done (hopefully) right
1. simple_strto*() do not contain overflow checks and crufty,
   libc way to indicate failure.
2. strict_strto*() also do not have overflow checks but the name and
   comments pretend they do.
3. Both families have only "long long" and "long" variants,
   but users want strtou8()
4. Both "simple" and "strict" prefixes are wrong:
   Simple doesn't exactly say what's so simple, strict should not exist
   because conversion should be strict by default.

The solution is to use "k" prefix and add convertors for more types.
Enter
	kstrtoull()
	kstrtoll()
	kstrtoul()
	kstrtol()
	kstrtouint()
	kstrtoint()

	kstrtou64()
	kstrtos64()
	kstrtou32()
	kstrtos32()
	kstrtou16()
	kstrtos16()
	kstrtou8()
	kstrtos8()

Include runtime testsuite (somewhat incomplete) as well.

strict_strto*() become deprecated, stubbed to kstrto*() and
eventually will be removed altogether.

Use kstrto*() in code today!

Note: on some archs _kstrtoul() and _kstrtol() are left in tree, even if
      they'll be unused at runtime. This is temporarily solution,
      because I don't want to hardcode list of archs where these
      functions aren't needed. Current solution with sizeof() and
      __alignof__ at least always works.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:14 -07:00
Amerigo Wang
34db18a054 smp: move smp setup functions to kernel/smp.c
Move setup_nr_cpu_ids(), smp_init() and some other SMP boot parameter
setup functions from init/main.c to kenrel/smp.c, saves some #ifdef
CONFIG_SMP.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Rakib Mullick <rakib.mullick@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:11 -07:00
Uwe Kleine-König
fa9ee9c4b9 include/linux/err.h: add a function to cast error-pointers to a return value
PTR_RET() can be used if you have an error-pointer and are only interested
in the eventual error value, but not the pointer.  Yields the usual 0 for
no error, -ESOMETHING otherwise.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:11 -07:00
Richard Kennedy
f3ccfcdaf3 fs.h: remove 8 bytes of padding from block_device on 64bit builds
Re-ordering struct block_inode to remove 8 bytes of padding on 64 bit
builds, which also shrinks bdev_inode by 8 bytes (776 -> 768) allowing it
to fit into one fewer cache lines.

Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:10 -07:00
Borislav Petkov
c837fb37a6 include/linux/compiler-gcc*.h: unify macro definitions
Unify identical gcc3.x and gcc4.x macros.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:10 -07:00
FUJITA Tomonori
3e50594e8e add the common dma_addr_t typedef to include/linux/types.h
All architectures can use the common dma_addr_t typedef now. We can
remove the arch specific dma_addr_t.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:09 -07:00
Andi Kleen
78afd5612d mm: add __GFP_OTHER_NODE flag
Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
zone_statistics() that an allocation is on behalf of another thread.  This
way the local and remote counters can be still correct, even when
background daemons like khugepaged are changing memory mappings.

This only affects the accounting, but I think it's worth doing that right
to avoid confusing users.

I first tried to just pass down the right node, but this required a lot of
changes to pass down this parameter and at least one addition of a 10th
argument to a 9 argument function.  Using the flag is a lot less
intrusive.

Open: should be also used for migration?

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:05 -07:00
Mel Gorman
8afdcece49 mm: vmscan: kswapd should not free an excessive number of pages when balancing small zones
When reclaiming for order-0 pages, kswapd requires that all zones be
balanced.  Each cycle through balance_pgdat() does background ageing on
all zones if necessary and applies equal pressure on the inactive zone
unless a lot of pages are free already.

A "lot of free pages" is defined as a "balance gap" above the high
watermark which is currently 7*high_watermark.  Historically this was
reasonable as min_free_kbytes was small.  However, on systems using huge
pages, it is recommended that min_free_kbytes is higher and it is tuned
with hugeadm --set-recommended-min_free_kbytes.  With the introduction of
transparent huge page support, this recommended value is also applied.  On
X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would
expect around 68M of memory to be free.  The Normal zone is approximately
35000 pages so under even normal memory pressure such as copying a large
file, it gets exhausted quickly.  As it is getting exhausted, kswapd
applies pressure equally to all zones, including the DMA32 zone.  DMA32 is
approximately 700,000 pages with a high watermark of around 23,000 pages.
In this situation, kswapd will reclaim around (23000*8 where 8 is the high
watermark + balance gap of 7 * high watermark) pages or 718M of pages
before the zone is ignored.  What the user sees is that free memory far
higher than it should be.

To avoid an excessive number of pages being reclaimed from the larger
zones, explicitely defines the "balance gap" to be either 1% of the zone
or the low watermark for the zone, whichever is smaller.  While kswapd
will check all zones to apply pressure, it'll ignore zones that meets the
(high_wmark + balance_gap) watermark.

To test this, 80G were copied from a partition and the amount of memory
being used was recorded.  A comparison of a patch and unpatched kernel can
be seen at
http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
and shows that kswapd is not reclaiming as much memory with the patch
applied.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:04 -07:00
Dave Hansen
033193275b pagewalk: only split huge pages when necessary
Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will
unconditionally split any transparent huge pages it runs in to.  In
practice, that means that anyone doing a

	cat /proc/$pid/smaps

will unconditionally break down every huge page in the process and depend
on khugepaged to re-collapse it later.  This is fairly suboptimal.

This patch changes that behavior.  It teaches each ->pmd_entry handler
(there are five) that they must break down the THPs themselves.  Also, the
_generic_ code will never break down a THP unless a ->pte_entry handler is
actually set.

This means that the ->pmd_entry handlers can now choose to deal with THPs
without breaking them down.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Tested-by: Eric B Munson <emunson@mgebm.net>
Cc: Michael J Wolf <mjwolf@us.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:04 -07:00
Minchan Kim
3f58a82943 memcg: move memcg reclaimable page into tail of inactive list
The rotate_reclaimable_page function moves just written out pages, which
the VM wanted to reclaim, to the end of the inactive list.  That way the
VM will find those pages first next time it needs to free memory.

This patch applies the rule in memcg.  It can help to prevent unnecessary
working page eviction of memcg.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:03 -07:00
Minchan Kim
315601809d mm: deactivate invalidated pages
Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup
workloads(ex, nightly rsync).  That's because the workload makes just
use-once pages and touches pages twice.  It promotes the page into active
list so that it results in working set page eviction.

Some app developer want to support POSIX_FADV_NOREUSE.  But other OSes
don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED.  But it has a
problem.  If kernel meets page is writing during invalidate_mapping_pages,
it can't work.  It makes for application programmer to use it since they
always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
make sure the pages could be discardable.  At last, they can't use
deferred write of kernel so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer.  It means we don't
use the page any more.  So let's move the writing page into inactive
list's head if we can't truncate it right now.

Why I move page to head of lru on this patch, Dirty/Writeback page would
be flushed sooner or later.  It can prevent writeout of pageout which is
less effective than flusher's writeout.

Originally, I reused lru_demote of Peter with some change so added his
Signed-off-by.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Ben Gamari <bgamari.foss@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:03 -07:00
Richard Kennedy
481b4bb5e3 mm: mm_struct: remove 16 bytes of alignment padding on 64 bit builds
Reorder mm_struct to remove 16 bytes of alignment padding on 64 bit
builds.  On my config this shrinks mm_struct by enough to fit in one
fewer cache lines and allows more objects per slab in mm_struct
kmem_cache under SLUB.

slabinfo before patch :-
    Sizes (bytes)     Slabs
    --------------------------------
    Object :     848  Total  :       9
    SlabObj:     896  Full   :       2
    SlabSiz:   16384  Partial:       5
    Loss   :      48  CpuSlab:       2
    Align  :      64  Objects:      18

 slabinfo after :-
    Sizes (bytes)     Slabs
    --------------------------------
    Object :     832  Total  :       7
    SlabObj:     832  Full   :       2
    SlabSiz:   16384  Partial:       3
    Loss   :       0  CpuSlab:       2
    Align  :      64  Objects:      19

Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:03 -07:00
Michel Lespinasse
cb240452bf mm: remove unused TestSetPageLocked() interface
TestSetPageLocked() isn't being used anywhere.  Also, using it would
likely be an error, since the proper interface trylock_page() provides
stronger ordering guarantees.

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:03 -07:00
Peter Zijlstra
01d8b20dec mm: simplify anon_vma refcounts
This patch changes the anon_vma refcount to be 0 when the object is free.
It does this by adding 1 ref to being in use in the anon_vma structure
(iow.  the anon_vma->head list is not empty).

This allows a simpler release scheme without having to check both the
refcount and the list as well as avoids taking a ref for each entry on the
list.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:03 -07:00
Peter Zijlstra
83813267c6 mm: move anon_vma ref out from under CONFIG_foo
We need the anon_vma refcount unconditionally to simplify the anon_vma
lifetime rules.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:03 -07:00
Peter Zijlstra
9e60109f12 mm: rename drop_anon_vma() to put_anon_vma()
The normal code pattern used in the kernel is: get/put.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:03 -07:00
Minchan Kim
e64a782fec mm: change __remove_from_page_cache()
Now we renamed remove_from_page_cache with delete_from_page_cache.  As
consistency of __remove_from_swap_cache and remove_from_swap_cache, we
change internal page cache handling function name, too.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Minchan Kim
702cfbf93a mm: goodbye remove_from_page_cache()
Now delete_from_page_cache() replaces remove_from_page_cache().  So we
remove remove_from_page_cache so fs or something out of mainline will
notice it when compile time and can fix it.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Minchan Kim
97cecb5a25 mm: introduce delete_from_page_cache()
Presently we increase the page refcount in add_to_page_cache() but don't
decrease it in remove_from_page_cache().  Such asymmetry adds confusion,
requiring that callers notice it and a comment explaining why they release
a page reference.  It's not a good API.

A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but
gave up because reiser4's drop_page() had to unlock the page between
removing it from page cache and doing the page_cache_release().  But now
the situation is changed.  I think at least things in current mainline
don't have any obstacles.  The problem is for out-of-mainline filesystems
- if they have done such things as reiser4, this patch could be a problem
but they will discover this at compile time since we remove
remove_from_page_cache().

This patch:

This function works as just wrapper remove_from_page_cache().  The
difference is that it decreases page references in itself.  So caller have
to make sure it has a page reference before calling.

This patch is ready for removing remove_from_page_cache().

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Miklos Szeredi
ef6a3c6311 mm: add replace_page_cache_page() function
This function basically does:

     remove_from_page_cache(old);
     page_cache_release(old);
     add_to_page_cache_locked(new);

Except it does this atomically, so there's no possibility for the "add" to
fail because of a race.

If memory cgroups are enabled, then the memory cgroup charge is also moved
from the old page to the new.

This function is currently used by fuse to move pages into the page cache
on read, instead of copying the page contents.

[minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Gleb Natapov
318b275fbc mm: allow GUP to fail instead of waiting on a page
GUP user may want to try to acquire a reference to a page if it is already
in memory, but not if IO, to bring it in, is needed.  For example KVM may
tell vcpu to schedule another guest process if current one is trying to
access swapped out page.  Meanwhile, the page will be swapped in and the
guest process, that depends on it, will be able to run again.

This patch adds FAULT_FLAG_RETRY_NOWAIT (suggested by Linus) and
FOLL_NOWAIT follow_page flags.  FAULT_FLAG_RETRY_NOWAIT, when used in
conjunction with VM_FAULT_ALLOW_RETRY, indicates to handle_mm_fault that
it shouldn't drop mmap_sem and wait on a page, but return VM_FAULT_RETRY
instead.

[akpm@linux-foundation.org: improve FOLL_NOWAIT comment]
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
David Rientjes
ddd588b5dd oom: suppress nodes that are not allowed from meminfo on oom kill
The oom killer is extremely verbose for machines with a large number of
cpus and/or nodes.  This verbosity can often be harmful if it causes other
important messages to be scrolled from the kernel log and incurs a
signicant time delay, specifically for kernels with CONFIG_NODES_SHIFT >
8.

This patch causes only memory information to be displayed for nodes that
are allowed by current's cpuset when dumping the VM state.  Information
for all other nodes is irrelevant to the oom condition; we don't care if
there's an abundance of memory elsewhere if we can't access it.

This only affects the behavior of dumping memory information when an oom
is triggered.  Other dumps, such as for sysrq+m, still display the
unfiltered form when using the existing show_mem() interface.

Additionally, the per-cpu pageset statistics are extremely verbose in oom
killer output, so it is now suppressed.  This removes

	nodes_weight(current->mems_allowed) * (1 + nr_cpus)

lines from the oom killer output.

Callers may use __show_mem(SHOW_MEM_FILTER_NODES) to filter disallowed
nodes.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:01 -07:00
Eric Dumazet
207205a2ba kthread: NUMA aware kthread_create_on_node()
All kthreads being created from a single helper task, they all use memory
from a single node for their kernel stack and task struct.

This patch suite creates kthread_create_on_node(), adding a 'cpu' parameter
to parameters already used by kthread_create().

This parameter serves in allocating memory for the new kthread on its
memory node if possible.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:01 -07:00
Andrea Arcangeli
d527caf22e mm: compaction: prevent kswapd compacting memory to reduce CPU usage
This patch reverts 5a03b051 ("thp: use compaction in kswapd for GFP_ATOMIC
order > 0") due to reports stating that kswapd CPU usage was higher and
IRQs were being disabled more frequently.  This was reported at
http://www.spinics.net/linux/fedora/alsa-user/msg09885.html.

Without this patch applied, CPU usage by kswapd hovers around the 20% mark
according to the tester (Arthur Marsh:
http://www.spinics.net/linux/fedora/alsa-user/msg09899.html).  With this
patch applied, it's around 2%.

The problem is not related to THP which specifies __GFP_NO_KSWAPD but is
triggered by high-order allocations hitting the low watermark for their
order and waking kswapd on kernels with CONFIG_COMPACTION set.  The most
common trigger for this is network cards configured for jumbo frames but
it's also possible it'll be triggered by fork-heavy workloads (order-1)
and some wireless cards which depend on order-1 allocations.

The symptoms for the user will be high CPU usage by kswapd in low-memory
situations which could be confused with another writeback problem.  While
a patch like 5a03b051 may be reintroduced in the future, this patch plays
it safe for now and reverts it.

[mel@csn.ul.ie: Beefed up the changelog]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reported-by: Arthur Marsh <arthur.marsh@internode.on.net>
Tested-by: Arthur Marsh <arthur.marsh@internode.on.net>
Cc: <stable@kernel.org>		[2.6.38.1]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:00 -07:00