Commit Graph

123 Commits

Author SHA1 Message Date
Nick Piggin 557ed1fa26 remove ZERO_PAGE
The commit b5810039a5 contains the note

  A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
  (and thus mapcounted and count towards shared rss).  These writes to
  the struct page could cause excessive cacheline bouncing on big
  systems.  There are a number of ways this could be addressed if it is
  an issue.

And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).

There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely

I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.

Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).

As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.

When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.

Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.

The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
much use to them except on benchmarks.  All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus "snif" Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Kawai, Hidehiro ee78b0a61f coredump masking: ELF-FDPIC: enable core dump filtering
This patch enables core dump filtering for ELF-FDPIC-formatted core file.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:47 -07:00
Kawai, Hidehiro e2e00906a0 coredump masking: ELF-FDPIC: remove an unused argument
This patch removes an unused argument from elf_fdpic_dump_segments().

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:47 -07:00
Ollie Wild b6a2fea393 mm: variable length argument support
Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
the old mm into the new mm.

We create the new mm before the binfmt code runs, and place the new stack at
the very top of the address space.  Once the binfmt code runs and figures out
where the stack should be, we move it downwards.

It is a bit peculiar in that we have one task with two mm's, one of which is
inactive.

[a.p.zijlstra@chello.nl: limit stack size]
Signed-off-by: Ollie Wild <aaw@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <linux-arch@vger.kernel.org>
Cc: Hugh Dickins <hugh@veritas.com>
[bunk@stusta.de: unexport bprm_mm_init]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 10:04:45 -07:00
Randy Dunlap e63340ae6b header cleaning: don't include smp_lock.h when not used
Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.

Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:07 -07:00
Brian Pomerantz 0322170260 [PATCH] fix page leak during core dump
When the dump cannot occur most likely because of a full file system and
the page to be written is the zero page, the call to page_cache_release()
is missed.

Signed-off-by: Brian Pomerantz <bapper@mvista.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-02 10:06:08 -07:00
David Howells aa289b4723 [PATCH] FDPIC: fix the /proc/pid/stat representation of executable boundaries
Fix the /proc/pid/stat representation of executable boundaries.  It should
show the bounds of the executable, but instead shows the bounds of the
loader.

Before the patch is applied, the bug can be seen by examining, say, inetd:

	# ps | grep inetd
	  610         root          0   S   /usr/sbin/inetd -i
	# cat /proc/610/maps
	c0bb0000-c0bba788 r-xs 00000000 00:0b 14582157  /lib/ld-uClibc-0.9.28.so
	c3180000-c31dede4 r-xs 00000000 00:0b 14582179  /lib/libuClibc-0.9.28.so
	c328c000-c328ea00 rw-p 00008000 00:0b 14582157  /lib/ld-uClibc-0.9.28.so
	c3290000-c329b6c0 rw-p 00000000 00:00 0
	c32a0000-c32c0000 rwxp 00000000 00:00 0
	c32d4000-c32d8000 rw-p 00000000 00:00 0
	c3394000-c3398000 rw-p 00000000 00:00 0
	c3458000-c345f464 r-xs 00000000 00:0b 16384612  /usr/sbin/inetd
	c3470000-c34748f8 rw-p 00004000 00:0b 16384612  /usr/sbin/inetd
	c34cc000-c34d0000 rw-p 00000000 00:00 0
	c34d4000-c34d8000 rw-p 00000000 00:00 0
	c34d8000-c34dc000 rw-p 00000000 00:00 0
	# cat /proc/610/stat
	610 (inetd) S 1 610 610 0 -1 256 0 0 0 0 0 8 0 0 19 0 1 0 94392000718
	950272 0 4294967295 3233480704 3233523592 3274440352 3274439976
 	3273467584 0 0 4096 90115 3221712796 0 0 17 0 0 0 0

The code boundaries are 3233480704 to 3233523592, which are:

	(gdb) p/x 3233480704
	$1 = 0xc0bb0000
	(gdb) p/x 3233523592
	$2 = 0xc0bba788

Which corresponds to this line in the maps file:

	c0bb0000-c0bba788 r-xs 00000000 00:0b 14582157  /lib/ld-uClibc-0.9.28.so

Which is wrong.  After the patch is applied, the maps file is pretty much
identical (there's some minor shuffling of the location of some of the
anonymous VMAs), but the stat file is now:

	# cat /proc/610/stat
	610 (inetd) S 1 610 610 0 -1 256 0 0 0 0 0 7 0 0 18 0 1 0 94392000722
	950272 0 4294967295 3276111872 3276141668 3274440352 3274439976
	3273467584 0 0 4096 90115 3221712796 0 0 17 0 0 0 0

The code boundaries are then 3276111872 to 3276141668, which are:

	(gdb) p/x 3276111872
	$1 = 0xc3458000
	(gdb) p/x 3276141668
	$2 = 0xc345f464

And these correspond to this line in the maps file instead:

	c3458000-c345f464 r-xs 00000000 00:0b 16384612  /usr/sbin/inetd

Which is now correct.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-23 11:01:21 -07:00
Robert P. J. Day e10a4437cb [PATCH] Remove final references to deprecated "MAP_ANON" page protection flag
Remove the last vestiges of the long-deprecated "MAP_ANON" page protection
flag: use "MAP_ANONYMOUS" instead.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Alexey Dobriyan 1fb8449618 [PATCH] core-dumping unreadable binaries via PT_INTERP
Proposed patch to fix #5 in
http://www.isec.pl/vulnerabilities/isec-0017-binfmt_elf.txt
aka
http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2004-1073

To reproduce, do
* grab poc at the end of advisory.
* add line "eph.p_memsz = 4096;" after "eph.p_filesz = 4096;"
  where first "4096" is something equal to or greater than 4096.
* ./poc /usr/bin/sudo && ls -l

Here I get with 2.6.20-rc5:

 -rw------- 1 ad   ad   102400 2007-01-15 19:17 core
 ---s--x--x 2 root root 101820 2007-01-15 19:15 /usr/bin/sudo

Check for MAY_READ like binfmt_misc.c does.

Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26 13:51:00 -08:00
Robert P. J. Day b87576d59b fs: Convert kmalloc() + memset() to kzalloc() in fs/.
Convert the single available instance of kmalloc() + memset() to
kzalloc() in the fs/ directory.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-12-12 20:07:35 +01:00
Cedric Le Goater 937949d9ed [PATCH] add process_session() helper routine
Replace occurences of task->signal->session by a new process_session() helper
routine.

It will be useful for pid namespaces to abstract the session pid number.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:51 -08:00
Josef "Jeff" Sipek 0f7fc9e4d0 [PATCH] VFS: change struct file to use struct path
This patch changes struct file to use struct path instead of having
independent pointers to struct dentry and struct vfsmount, and converts all
users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

Additionally, it adds two #define's to make the transition easier for users of
the f_dentry and f_vfsmnt.

Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:41 -08:00
Magnus Damm 386d9a7edd [PATCH] elf: Always define elf_addr_t in linux/elf.h
Define elf_addr_t in linux/elf.h.  The size of the type is determined using
ELF_CLASS.  This allows us to remove the defines that today are spread all
over .c and .h files.

Signed-off-by: Magnus Damm <magnus@valinux.co.jp>
Cc: Daniel Jacobowitz <drow@false.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:38 -08:00
Oleg Nesterov bce9a234ce [PATCH] elf_fdpic_core_dump: don't take tasklist_lock
do_each_thread() is rcu-safe, and all tasks which use this ->mm must sleep
in wait_for_completion(&mm->core_done) at this point, so we can use RCU
locks.

Also, remove unneeded INIT_LIST_HEAD(new) before list_add(new, head).

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:14 -07:00
David Howells 6d8c4e3b01 [PATCH] FDPIC: Add coredump capability for the ELF-FDPIC binfmt
Add coredump capability for the ELF-FDPIC binfmt.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:22 -07:00
David Howells 8a2ab7f5df [PATCH] FDPIC: Adjust the ELF-FDPIC driver to conform more to the CodingStyle
Adjust the ELF-FDPIC binfmt driver to conform much more to the CodingStyle,
silly though it may be.

Further changes:

 (*) Drop the casts to long for addresses in kdebug() statements (they're
     unsigned long already).

 (*) Use extra variables to avoid expressions longer than 80 chars by splitting
     the statement into multiple statements and letting the compiler optimise
     them back together.

 (*) Eliminate duplicate call of ksize() when working out how much space was
     actually allocated for the stack.

 (*) Discard the commented-out load_shlib prototype and op pointer as this will
     not be supported in ELF-FDPIC for the foreseeable future.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:21 -07:00
David Howells 1aeb21d626 [PATCH] FDPIC: Fix FDPIC compile errors
Fix FDPIC compile errors.

(akpm: we suspect it fixes a warning)

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:21 -07:00
Al Viro 530018bf3d [PATCH] frv: binfmt_elf_fdpic __user annotations
Add __user annotations to binfmt_elf_fdpic.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:54 -07:00
Eric Sesterhenn 88bcd51262 BUG_ON() Conversion in fs/binfmt_elf_fdpic.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner and can better optimized away

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-24 18:38:48 +01:00
Jesper Juhl 792db3af38 [PATCH] fs/binfmt_elf: Remove unneeded kmalloc() return value casts
Remove unneeded casts of kmalloc() return value in binfmt_elf.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:02:01 -08:00
Jesper Juhl f99d49adf5 [PATCH] kfree cleanup: fs
This is the fs/ part of the big kfree cleanup patch.

Remove pointless checks for NULL prior to calling kfree() in fs/.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:54:06 -08:00
Hugh Dickins 404351e67a [PATCH] mm: mm_init set_mm_counters
How is anon_rss initialized?  In dup_mmap, and by mm_alloc's memset; but
that's not so good if an mm_counter_t is a special type.  And how is rss
initialized?  By set_mm_counter, all over the place.  Come on, we just need to
initialize them both at once by set_mm_counter in mm_init (which follows the
memcpy when forking).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:38 -07:00
Linus Torvalds 1da177e4c3 Linux-2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!
2005-04-16 15:20:36 -07:00