Commit graph

148 commits

Author SHA1 Message Date
Michal Hocko
19809c2da2 mm, vmalloc: use __GFP_HIGHMEM implicitly
__vmalloc* allows users to provide gfp flags for the underlying
allocation.  This API is quite popular

  $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
  77

The only problem is that many people are not aware that they really want
to give __GFP_HIGHMEM along with other flags because there is really no
reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
which are mapped to the kernel vmalloc space.  About half of users don't
use this flag, though.  This signals that we make the API unnecessarily
too complex.

This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
are simplified and drop the flag.

Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Cristopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Ingo Molnar
3f07c01441 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:29 +01:00
Alexey Dobriyan
9b80a184ea fs/file: more unsigned file descriptors
Propagate unsignedness for grand total of 149 bytes:

	$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
	add/remove: 0/0 grow/shrink: 0/10 up/down: 0/-149 (-149)
	function                                     old     new   delta
	set_close_on_exec                             99      98      -1
	put_files_struct                             201     200      -1
	get_close_on_exec                             59      58      -1
	do_prlimit                                   498     497      -1
	do_execveat_common.isra                     1662    1661      -1
	__close_fd                                   178     173      -5
	do_dup2                                      219     204     -15
	seq_show                                     685     660     -25
	__alloc_fd                                   384     357     -27
	dup_fd                                       718     646     -72

It mostly comes from converting "unsigned int" to "long" for bit operations.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 18:47:38 -04:00
Al Viro
63b6df1413 give readdir(2)/getdents(2)/etc. uniform exclusion with lseek()
same as read() on regular files has, and for the same reason.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:28 -04:00
Vladimir Davydov
5d097056c9 kmemcg: account certain kmem allocations to memcg
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Rasmus Villemoes
752343be63 fs/file.c: __const_max is actually __const_min :-)
7f4b36f9bb "get rid of files_defer_init()" inexplicably changed a
min() to a __const_max() - but the __const_max macro actually gives
the minimum... So no functional change, just less confusing naming.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-06 21:17:11 -05:00
Eric Biggers
ea5c58e70c vfs: clear remainder of 'full_fds_bits' in dup_fd()
This fixes a bug from commit f3f86e33dc ("vfs: Fix pathological
performance case for __alloc_fd()").

v2: refactor to share fd bitmap copying code
Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 23:05:32 -08:00
Linus Torvalds
fc90888d07 vfs: conditionally clear close-on-exec flag
We clear the close-on-exec flag when opening and closing files, and the
bit was almost always already clear before.  Avoid dirtying the
cacheline if the clearning isn't necessary.  That avoids unnecessary
cacheline dirtying and bouncing in multi-socket environments.

Eric Dumazet has a file descriptor benchmark that goes 4% faster from
this on his two-socket machine.  It's probably partly superlinear
improvement due to getting slightly less spinlock contention on the
file_lock spinlock due to less work in the critical section.

Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-31 16:14:51 -07:00
Linus Torvalds
f3f86e33dc vfs: Fix pathological performance case for __alloc_fd()
Al Viro points out that:
> >     * [Linux-specific aside] our __alloc_fd() can degrade quite badly
> > with some use patterns.  The cacheline pingpong in the bitmap is probably
> > inevitable, unless we accept considerably heavier memory footprint,
> > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> > to trigger - close(3);open(...); will have the next open() after that
> > scanning the entire in-use bitmap.

And Eric Dumazet has a somewhat realistic multithreaded microbenchmark
that opens and closes a lot of sockets with minimal work per socket.

This patch largely fixes it.  We keep a 2nd-level bitmap of the open
file bitmaps, showing which words are already full.  So then we can
traverse that second-level bitmap to efficiently skip already allocated
file descriptors.

On his benchmark, this improves performance by up to an order of
magnitude, by avoiding the excessive open file bitmap scanning.

Tested-and-acked-by: Eric Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-31 16:12:10 -07:00
Eric Dumazet
5ba97d2832 fs/file.c: __fget() and dup2() atomicity rules
__fget() does lockless fetch of pointer from the descriptor
table, attempts to grab a reference and treats "it was already
zero" as "it's already gone from the table, we just hadn't
seen the store, let's fail".  Unfortunately, that breaks the
atomicity of dup2() - __fget() might see the old pointer,
notice that it's been already dropped and treat that as
"it's closed".  What we should be getting is either the
old file or new one, depending whether we come before or after
dup2().

Dmitry had following test failing sometimes :

int fd;
void *Thread(void *x) {
  char buf;
  int n = read(fd, &buf, 1);
  if (n != 1)
    exit(printf("read failed: n=%d errno=%d\n", n, errno));
  return 0;
}

int main()
{
  fd = open("/dev/urandom", O_RDONLY);
  int fd2 = open("/dev/urandom", O_RDONLY);
  if (fd == -1 || fd2 == -1)
    exit(printf("open failed\n"));
  pthread_t th;
  pthread_create(&th, 0, Thread, 0);
  if (dup2(fd2, fd) == -1)
    exit(printf("dup2 failed\n"));
  pthread_join(th, 0);
  if (close(fd) == -1)
    exit(printf("close failed\n"));
  if (close(fd2) == -1)
    exit(printf("close failed\n"));
  printf("DONE\n");
  return 0;
}

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-01 02:31:08 -04:00
Eric Dumazet
8a81252b77 fs/file.c: don't acquire files->file_lock in fd_install()
Mateusz Guzik reported :

 Currently obtaining a new file descriptor results in locking fdtable
 twice - once in order to reserve a slot and second time to fill it.

Holding the spinlock in __fd_install() is needed in case a resize is
done, or to prevent a resize.

Mateusz provided an RFC patch and a micro benchmark :
  http://people.redhat.com/~mguzik/pipebench.c

A resize is an unlikely operation in a process lifetime,
as table size is at least doubled at every resize.

We can use RCU instead of the spinlock.

__fd_install() must wait if a resize is in progress.

The resize must block new __fd_install() callers from starting,
and wait that ongoing install are finished (synchronize_sched())

resize should be attempted by a single thread to not waste resources.

rcu_sched variant is used, as __fd_install() and expand_fdtable() run
from process context.

It gives us a ~30% speedup using pipebench on a dual Intel(R) Xeon(R)
CPU E5-2696 v2 @ 2.50GHz

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Mateusz Guzik <mguzik@redhat.com>
Tested-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-01 02:30:09 -04:00
Konstantin Khlebnikov
90f31d0ea8 mm: rcu-protected get_mm_exe_file()
This patch removes mm->mmap_sem from mm->exe_file read side.
Also it kills dup_mm_exe_file() and moves exe_file duplication into
dup_mmap() where both mmap_sems are locked.

[akpm@linux-foundation.org: fix comment typo]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:07 -04:00
Yann Droneaud
8d10a03582 fs/file.c: replace get_unused_fd() with get_unused_fd_flags(0)
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.

In a further patch, get_unused_fd() will be removed so that new code
start using get_unused_fd_flags(), with the hope O_CLOEXEC could be
used, either by default or choosen by userspace.

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:10 -08:00
Linus Torvalds
d6dd50e07c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - changes related to No-CBs CPUs and NO_HZ_FULL

   - RCU-tasks implementation

   - torture-test updates

   - miscellaneous fixes

   - locktorture updates

   - RCU documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
  workqueue: Use cond_resched_rcu_qs macro
  workqueue: Add quiescent state between work items
  locktorture: Cleanup header usage
  locktorture: Cannot hold read and write lock
  locktorture: Fix __acquire annotation for spinlock irq
  locktorture: Support rwlocks
  rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
  locktorture: Document boot/module parameters
  rcutorture: Rename rcutorture_runnable parameter
  locktorture: Add test scenario for rwsem_lock
  locktorture: Add test scenario for mutex_lock
  locktorture: Make torture scripting account for new _runnable name
  locktorture: Introduce torture context
  locktorture: Support rwsems
  locktorture: Add infrastructure for torturing read locks
  torture: Address race in module cleanup
  locktorture: Make statistics generic
  locktorture: Teach about lock debugging
  locktorture: Support mutexes
  locktorture: Add documentation
  ...
2014-10-13 15:44:12 +02:00
Al Viro
e983094d6d missing annotation in fs/file.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:11 -04:00
Paul E. McKenney
bde6c3aa99 rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
RCU-tasks requires the occasional voluntary context switch
from CPU-bound in-kernel tasks.  In some cases, this requires
instrumenting cond_resched().  However, there is some reluctance
to countenance unconditionally instrumenting cond_resched() (see
http://lwn.net/Articles/603252/), so this commit creates a separate
cond_resched_rcu_qs() that may be used in place of cond_resched() in
locations prone to long-duration in-kernel looping.

This commit currently instruments only RCU-tasks.  Future possibilities
include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
IPI usage.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:20 -07:00
Al Viro
f6c0a1920e fs/file.c: don't open-code kvfree()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:31:10 -04:00
Linus Torvalds
5166701b36 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "The first vfs pile, with deep apologies for being very late in this
  window.

  Assorted cleanups and fixes, plus a large preparatory part of iov_iter
  work.  There's a lot more of that, but it'll probably go into the next
  merge window - it *does* shape up nicely, removes a lot of
  boilerplate, gets rid of locking inconsistencie between aio_write and
  splice_write and I hope to get Kent's direct-io rewrite merged into
  the same queue, but some of the stuff after this point is having
  (mostly trivial) conflicts with the things already merged into
  mainline and with some I want more testing.

  This one passes LTP and xfstests without regressions, in addition to
  usual beating.  BTW, readahead02 in ltp syscalls testsuite has started
  giving failures since "mm/readahead.c: fix readahead failure for
  memoryless NUMA nodes and limit readahead pages" - might be a false
  positive, might be a real regression..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  missing bits of "splice: fix racy pipe->buffers uses"
  cifs: fix the race in cifs_writev()
  ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
  kill generic_file_buffered_write()
  ocfs2_file_aio_write(): switch to generic_perform_write()
  ceph_aio_write(): switch to generic_perform_write()
  xfs_file_buffered_aio_write(): switch to generic_perform_write()
  export generic_perform_write(), start getting rid of generic_file_buffer_write()
  generic_file_direct_write(): get rid of ppos argument
  btrfs_file_aio_write(): get rid of ppos
  kill the 5th argument of generic_file_buffered_write()
  kill the 4th argument of __generic_file_aio_write()
  lustre: don't open-code kernel_recvmsg()
  ocfs2: don't open-code kernel_recvmsg()
  drbd: don't open-code kernel_recvmsg()
  constify blk_rq_map_user_iov() and friends
  lustre: switch to kernel_sendmsg()
  ocfs2: don't open-code kernel_sendmsg()
  take iov_iter stuff to mm/iov_iter.c
  process_vm_access: tidy up a bit
  ...
2014-04-12 14:49:50 -07:00
Al Viro
7f4b36f9bb get rid of files_defer_init()
the only thing it's doing these days is calculation of
upper limit for fs.nr_open sysctl and that can be done
statically

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:14 -04:00
Linus Torvalds
b3fd4ea9df Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "Main changes:

   - Torture-test changes, including refactoring of rcutorture and
     introduction of a vestigial locktorture.

   - Real-time latency fixes.

   - Documentation updates.

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
  rcu: Provide grace-period piggybacking API
  rcu: Ensure kernel/rcu/rcu.h can be sourced/used stand-alone
  rcu: Fix sparse warning for rcu_expedited from kernel/ksysfs.c
  notifier: Substitute rcu_access_pointer() for rcu_dereference_raw()
  Documentation/memory-barriers.txt: Clarify release/acquire ordering
  rcutorture: Save kvm.sh output to log
  rcutorture: Add a lock_busted to test the test
  rcutorture: Place kvm-test-1-run.sh output into res directory
  rcutorture: Rename TREE_RCU-Kconfig.txt
  locktorture: Add kvm-recheck.sh plug-in for locktorture
  rcutorture: Gracefully handle NULL cleanup hooks
  locktorture: Add vestigial locktorture configuration
  rcutorture: Introduce "rcu" directory level underneath configs
  rcutorture: Rename kvm-test-1-rcu.sh
  rcutorture: Remove RCU dependencies from ver_functions.sh API
  rcutorture: Create CFcommon file for common Kconfig parameters
  rcutorture: Create config files for scripted test-the-test testing
  rcutorture: Add an rcu_busted to test the test
  locktorture: Add a lock-torture kernel module
  rcutorture: Abstract kvm-recheck.sh
  ...
2014-03-31 11:05:24 -07:00
Eric Biggers
99aea68134 vfs: Don't let __fdget_pos() get FMODE_PATH files
Commit bd2a31d522 ("get rid of fget_light()") introduced the
__fdget_pos() function, which returns the resulting file pointer and
fdput flags combined in an 'unsigned long'.  However, it also changed the
behavior to return files with FMODE_PATH set, which shouldn't happen
because read(), write(), lseek(), etc. aren't allowed on such files.
This commit restores the old behavior.

This regression actually had no effect on read() and write() since
FMODE_READ and FMODE_WRITE are not set on file descriptors opened with
O_PATH, but it did cause lseek() on a file descriptor opened with O_PATH
to fail with ESPIPE rather than EBADF.

Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-23 00:03:12 -04:00
Al Viro
bd2a31d522 get rid of fget_light()
instead of returning the flags by reference, we can just have the
low-level primitive return those in lower bits of unsigned long,
with struct file * derived from the rest.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10 11:44:42 -04:00
Paul E. McKenney
add1f09954 fs: Substitute rcu_access_pointer() for rcu_dereference_raw()
(Trivial patch.)

If the code is looking at the RCU-protected pointer itself, but not
dereferencing it, the rcu_dereference() functions can be downgraded to
rcu_access_pointer().  This commit makes this downgrade in __alloc_fd(),
which simply compares the RCU-protected pointer against NULL with no
dereferencing.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 15:02:21 -08:00
Eric W. Biederman
96c7a2ff21 fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem
Recently due to a spike in connections per second memcached on 3
separate boxes triggered the OOM killer from accept.  At the time the
OOM killer was triggered there was 4GB out of 36GB free in zone 1.  The
problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
hold a bitmap, and there was sufficient fragmentation that the largest
page available was 8KiB.

I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
but I do agree that order 3 allocations are very likely to succeed.

There are always pathologies where order > 0 allocations can fail when
there are copious amounts of free memory available.  Using the pigeon
hole principle it is easy to show that it requires 1 page more than 50%
of the pages being free to guarantee an order 1 (8KiB) allocation will
succeed, 1 page more than 75% of the pages being free to guarantee an
order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
the pages being free to guarantee an order 3 allocate will succeed.

A server churning memory with a lot of small requests and replies like
memcached is a common case that if anything can will skew the odds
against large pages being available.

Therefore let's not give external applications a practical way to kill
linux server applications, and specify __GFP_NORETRY to the kmalloc in
alloc_fdmem.  Unless I am misreading the code and by the time the code
reaches should_alloc_retry in __alloc_pages_slowpath (where
__GFP_NORETRY becomes signification).  We have already tried everything
reasonable to allocate a page and the only thing left to do is wait.  So
not waiting and falling back to vmalloc immediately seems like the
reasonable thing to do even if there wasn't a chance of triggering the
OOM killer.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Cong Wang <cwang@twopensource.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10 16:01:41 -08:00
Oleg Nesterov
e6ff9a9fa4 fs: __fget_light() can use __fget() in slow path
The slow path in __fget_light() can use __fget() to avoid the
code duplication. Saves 232 bytes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 03:14:38 -05:00
Oleg Nesterov
ad46183445 fs: factor out common code in fget_light() and fget_raw_light()
Apart from FMODE_PATH check fget_light() and fget_raw_light() are
identical, shift the code into the new helper, __fget_light(fd, mask).
Saves 208 bytes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 03:14:37 -05:00
Oleg Nesterov
1deb46e256 fs: factor out common code in fget() and fget_raw()
Apart from FMODE_PATH check fget() and fget_raw() are identical,
shift the code into the new simple helper, __fget(fd, mask). Saves
160 bytes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 03:14:37 -05:00
Oleg Nesterov
ce08b62d18 change close_files() to use rcu_dereference_raw(files->fdt)
put_files_struct() and close_files() do rcu_read_lock() to make
rcu_dereference_check_fdtable() happy.

This looks a bit ugly, files_fdtable() just reads the pointer,
we can simply use rcu_dereference_raw() to avoid the warning.

The patch also changes close_files() to return fdt, this avoids
another rcu_read_lock()/files_fdtable() in put_files_struct().

I think close_files() needs more cleanups:

	- we do not need xchg() exactly because we are the last
	  user of this files_struct

	- "if (file)" should be turned into WARN_ON(!file)

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 03:14:37 -05:00
Oleg Nesterov
a8d4b8345e introduce __fcheck_files() to fix rcu_dereference_check_fdtable(), kill rcu_my_thread_group_empty()
rcu_dereference_check_fdtable() looks very wrong,

1. rcu_my_thread_group_empty() was added by 844b9a8707 "vfs: fix
   RCU-lockdep false positive due to /proc" but it doesn't really
   fix the problem. A CLONE_THREAD (without CLONE_FILES) task can
   hit the same race with get_files_struct().

   And otoh rcu_my_thread_group_empty() can suppress the correct
   warning if the caller is the CLONE_FILES (without CLONE_THREAD)
   task.

2. files->count == 1 check is not really right too. Even if this
   files_struct is not shared it is not safe to access it lockless
   unless the caller is the owner.

   Otoh, this check is sub-optimal. files->count == 0 always means
   it is safe to use it lockless even if files != current->files,
   but put_files_struct() has to take rcu_read_lock(). See the next
   patch.

This patch removes the buggy checks and turns fcheck_files() into
__fcheck_files() which uses rcu_dereference_raw(), the "unshared"
callers, fget_light() and fget_raw_light(), can use it to avoid
the warning from RCU-lockdep.

fcheck_files() is trivially reimplemented as rcu_lockdep_assert()
plus __fcheck_files().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 03:14:36 -05:00
Al Viro
ac3e3c5b11 don't bother with deferred freeing of fdtables
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:31:42 -04:00
Thomas Gleixner
eece09ec21 locking: Various static lock initializer fixes
The static lock initializers want to be fed the proper name of the
lock and not some random string. In mainline random strings are
obfuscating the readability of debug output, but for RT they prevent
the spinlock substitution. Fix it up.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-02-19 08:42:45 +01:00
Greg Kroah-Hartman
6ae141718e misc: remove __dev* attributes.
CONFIG_HOTPLUG is going away as an option.  As a result, the __dev*
markings need to be removed.

This change removes the last of the __dev* markings from the kernel from
a variety of different, tiny, places.

Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.

Cc: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-03 15:57:16 -08:00
Linus Torvalds
9977d9b379 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull big execve/kernel_thread/fork unification series from Al Viro:
 "All architectures are converted to new model.  Quite a bit of that
  stuff is actually shared with architecture trees; in such cases it's
  literally shared branch pulled by both, not a cherry-pick.

  A lot of ugliness and black magic is gone (-3KLoC total in this one):

   - kernel_thread()/kernel_execve()/sys_execve() redesign.

     We don't do syscalls from kernel anymore for either kernel_thread()
     or kernel_execve():

     kernel_thread() is essentially clone(2) with callback run before we
     return to userland, the callbacks either never return or do
     successful do_execve() before returning.

     kernel_execve() is a wrapper for do_execve() - it doesn't need to
     do transition to user mode anymore.

     As a result kernel_thread() and kernel_execve() are
     arch-independent now - they live in kernel/fork.c and fs/exec.c
     resp.  sys_execve() is also in fs/exec.c and it's completely
     architecture-independent.

   - daemonize() is gone, along with its parts in fs/*.c

   - struct pt_regs * is no longer passed to do_fork/copy_process/
     copy_thread/do_execve/search_binary_handler/->load_binary/do_coredump.

   - sys_fork()/sys_vfork()/sys_clone() unified; some architectures
     still need wrappers (ones with callee-saved registers not saved in
     pt_regs on syscall entry), but the main part of those suckers is in
     kernel/fork.c now."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (113 commits)
  do_coredump(): get rid of pt_regs argument
  print_fatal_signal(): get rid of pt_regs argument
  ptrace_signal(): get rid of unused arguments
  get rid of ptrace_signal_deliver() arguments
  new helper: signal_pt_regs()
  unify default ptrace_signal_deliver
  flagday: kill pt_regs argument of do_fork()
  death to idle_regs()
  don't pass regs to copy_process()
  flagday: don't pass regs to copy_thread()
  bfin: switch to generic vfork, get rid of pointless wrappers
  xtensa: switch to generic clone()
  openrisc: switch to use of generic fork and clone
  unicore32: switch to generic clone(2)
  score: switch to generic fork/vfork/clone
  c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic clone()
  take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h
  mn10300: switch to generic fork/vfork/clone
  h8300: switch to generic fork/vfork/clone
  tile: switch to generic clone()
  ...

Conflicts:
	arch/microblaze/include/asm/Kbuild
2012-12-12 12:22:13 -08:00
Al Viro
a77cfcb429 fix off-by-one in argument passed by iterate_fd() to callbacks
Noticed by Pavel Roskin; the thing in his patch I disagree with
was compensating for that shite in callbacks instead of fixing
it once in the iterator itself.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-29 23:01:30 -05:00
Al Viro
c4144670fd kill daemonize()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-28 21:49:02 -05:00
Linus Torvalds
8d938105e4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc VFS fixes from Al Viro:
 "Remove a bogus BUG_ON() that can trigger spuriously + alpha bits of
  do_mount() constification I'd missed during the merge window."

This pull request came in a week ago, I missed it for some reason.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kill bogus BUG_ON() in do_close_on_exec()
  missing const in alpha callers of do_mount()
2012-11-18 09:13:48 -10:00
Al Viro
5a8477660d kill bogus BUG_ON() in do_close_on_exec()
It can be legitimately triggered via procfs access.  Now, at least
2 of 3 of get_files_struct() callers in procfs are useless, but
when and if we get rid of those we can always add WARN_ON() here.
BUG_ON() at that spot is simply wrong.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-12 01:19:02 -05:00
Al Viro
08f05c4974 Return the right error value when dup[23]() newfd argument is too large
Jack Lin reports that the error return from dup3() for the RLIMIT_NOFILE
case changed incorrectly after 3.6.

The culprit is commit f33ff9927f ("take rlimit check to callers of
expand_files()") which when it moved the "return -EMFILE" out to the
caller, didn't notice that the dup3() had special code to turn the
EMFILE return into EBADF.

The replace_fd() helper that got added later then inherited the bug too.

Reported-by: Jack Lin <linliangjie@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[ Noted more bugs, wrote proper changelog, fixed up typos - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-30 21:27:28 -07:00
Richard W.M. Jones
aed976475b dup3: Return an error when oldfd == newfd.
I have tested the attached patch to fix the dup3 regression.

Rich.

From 0944e30e12dec6544b3602626b60ff412375c78f Mon Sep 17 00:00:00 2001
From: "Richard W.M. Jones" <rjones@redhat.com>
Date: Tue, 9 Oct 2012 14:42:45 +0100
Subject: [PATCH] dup3: Return an error when oldfd == newfd.

The following commit:

  commit fe17f22d7f
  Author: Al Viro <viro@zeniv.linux.org.uk>
  Date:   Tue Aug 21 11:48:11 2012 -0400

    take purely descriptor-related stuff from fcntl.c to file.c

was supposed to be just code motion, but it dropped the following two
lines:

  if (unlikely(oldfd == newfd))
          return -EINVAL;

from the dup3 system call.  dup3 is not specified by POSIX, so Linux
can do what it likes.  However the POSIX proposal for dup3 [1] states
that it should return an error if oldfd == newfd.

[1] http://austingroupbugs.net/view.php?id=411

Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-09 23:33:38 -04:00
Al Viro
4557c669ef export fget_light
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:10:06 -04:00
Al Viro
864bdb3b6c new helper: daemonize_descriptors()
descriptor-related parts of daemonize, done right.  As the
result we simplify the locking rules for ->files - we
hold task_lock in *all* cases when we modify ->files.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:10:00 -04:00
Al Viro
c3c073f808 new helper: iterate_fd()
iterates through the opened files in given descriptor table,
calling a supplied function; we stop once non-zero is returned.
Callback gets struct file *, descriptor number and const void *
argument passed to iterator.  It is called with files->file_lock
held, so it is not allowed to block.

tty_io, netprio_cgroup and selinux flush_unauthorized_files()
converted to its use.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:09:59 -04:00
Al Viro
ad47bd7252 make expand_files() and alloc_fd() static
no callers outside of fs/file.c left

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:09:58 -04:00
Al Viro
b8318b01a8 take __{set,clear}_{open_fd,close_on_exec}() into fs/file.c
nobody uses those outside anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:09:58 -04:00
Al Viro
8280d16172 new helper: replace_fd()
analog of dup2(), except that it takes struct file * as source.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:09:57 -04:00
Al Viro
fe17f22d7f take purely descriptor-related stuff from fcntl.c to file.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:09:57 -04:00
Al Viro
6a6d27de34 take close-on-exec logics to fs/file.c, clean it up a bit
... and add cond_resched() there, while we are at it.  We can
get large latencies as is...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:09:56 -04:00
Al Viro
483ce1d4b8 take descriptor-related part of close() to file.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:56 -04:00
Al Viro
0ee8cdfe6a take fget() and friends to fs/file.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:56 -04:00
Al Viro
f869e8a7f7 expose a low-level variant of fd_install() for binder
Similar situation to that of __alloc_fd(); do not use unless you
really have to.  You should not touch any descriptor table other
than your own; it's a sure sign of a really bad API design.

As with __alloc_fd(), you *must* use a first-class reference to
struct files_struct; something obtained by get_files_struct(some task)
(let alone direct task->files) will not do.  It must be either
current->files, or obtained by get_files_struct(current) by the
owner of that sucker and given to you.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:55 -04:00
Al Viro
56007cae94 move put_unused_fd() and fd_install() to fs/file.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:55 -04:00
Al Viro
1983e781da trim free_fdtable_rcu()
embedded case isn't hit anymore

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:54 -04:00
Al Viro
b9e02af0ae don't bother with call_rcu() in put_files_struct()
At that point nobody can see us anyway; everything that
looks at files_fdtable(files) is separated from the
guts of put_files_struct(files) - either since files is
current->files or because we fetched it under task_lock()
and hadn't dropped that yet, or because we'd bumped
files->count while holding task_lock()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:54 -04:00
Al Viro
7cf4dc3c8d move files_struct-related bits from kernel/exit.c to fs/file.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:54 -04:00
Al Viro
dcfadfa4ec new helper: __alloc_fd()
Essentially, alloc_fd() in a files_struct we own a reference to.
Most of the time wanting to use it is a sign of lousy API
design (such as android/binder).  It's *not* a general-purpose
interface; better that than open-coding its guts, but again,
playing with other process' descriptor table is a sign of bad
design.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:53 -04:00
Al Viro
f33ff9927f take rlimit check to callers of expand_files()
... except for one in android, where the check is different
and already done in caller.  No need to recalculate rlimit
many times in alloc_fd() either.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:53 -04:00
Al Viro
1a7bd2265f make get_unused_fd_flags() a function
... and get_unused_fd() a macro around it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:50 -04:00
Linus Torvalds
a591afc01d Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x32 support for x86-64 from Ingo Molnar:
 "This tree introduces the X32 binary format and execution mode for x86:
  32-bit data space binaries using 64-bit instructions and 64-bit kernel
  syscalls.

  This allows applications whose working set fits into a 32 bits address
  space to make use of 64-bit instructions while using a 32-bit address
  space with shorter pointers, more compressed data structures, etc."

Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}

* 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  x32: Fix alignment fail in struct compat_siginfo
  x32: Fix stupid ia32/x32 inversion in the siginfo format
  x32: Add ptrace for x32
  x32: Switch to a 64-bit clock_t
  x32: Provide separate is_ia32_task() and is_x32_task() predicates
  x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
  x86/x32: Fix the binutils auto-detect
  x32: Warn and disable rather than error if binutils too old
  x32: Only clear TIF_X32 flag once
  x32: Make sure TS_COMPAT is cleared for x32 tasks
  fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
  fs: Fix close_on_exec pointer in alloc_fdtable
  x32: Drop non-__vdso weak symbols from the x32 VDSO
  x32: Fix coding style violations in the x32 VDSO code
  x32: Add x32 VDSO support
  x32: Allow x32 to be configured
  x32: If configured, add x32 system calls to system call tables
  x32: Handle process creation
  x32: Signal-related system calls
  x86: Add #ifdef CONFIG_COMPAT to <asm/sys_ia32.h>
  ...
2012-03-29 18:12:23 -07:00
Paul Gortmaker
630d9c4727 fs: reduce the use of module.h wherever possible
For files only using THIS_MODULE and/or EXPORT_SYMBOL, map
them onto including export.h -- or if the file isn't even
using those, then just delete the include.  Fix up any implicit
include dependencies that were being masked by module.h along
the way.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-02-28 19:31:58 -05:00
Bobby Powers
f044db4cb4 fs: Fix close_on_exec pointer in alloc_fdtable
alloc_fdtable allocates space for the open_fds and close_on_exec
bitfields together, as 2 * nr / BITS_PER_BYTE.  close_on_exec needs to
point to open_fds + nr / BITS_PER_BYTE, not open_fds + nr /
BITS_PER_LONG, as introducted in 1fd36adc: Replace the fd_sets in
struct fdtable with an array of unsigned longs.

Signed-off-by: Bobby Powers <bobbypowers@gmail.com>
Link: http://lkml.kernel.org/r/1329888587-3087-1-git-send-email-bobbypowers@gmail.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-23 18:28:52 -08:00
David Howells
1fd36adcd9 Replace the fd_sets in struct fdtable with an array of unsigned longs
Replace the fd_sets in struct fdtable with an array of unsigned longs and then
use the standard non-atomic bit operations rather than the FD_* macros.

This:

 (1) Removes the abuses of struct fd_set:

     (a) Since we don't want to allocate a full fd_set the vast majority of the
     	 time, we actually, in effect, just allocate a just-big-enough array of
     	 unsigned longs and cast it to an fd_set type - so why bother with the
     	 fd_set at all?

     (b) Some places outside of the core fdtable handling code (such as
     	 SELinux) want to look inside the array of unsigned longs hidden inside
     	 the fd_set struct for more efficient iteration over the entire set.

 (2) Eliminates the use of FD_*() macros in the kernel completely.

 (3) Permits the __FD_*() macros to be deleted entirely where not exposed to
     userspace.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20120216174954.23314.48147.stgit@warthog.procyon.org.uk
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2012-02-19 10:30:57 -08:00
David Howells
1dce27c5aa Wrap accesses to the fd_sets in struct fdtable
Wrap accesses to the fd_sets in struct fdtable (for recording open files and
close-on-exec flags) so that we can move away from using fd_sets since we
abuse the fd_set structs by not allocating the full-sized structure under
normal circumstances and by non-core code looking at the internals of the
fd_sets.

The first abuse means that use of FD_ZERO() on these fd_sets is not permitted,
since that cannot be told about their abnormal lengths.

This introduces six wrapper functions for setting, clearing and testing
close-on-exec flags and fd-is-open flags:

	void __set_close_on_exec(int fd, struct fdtable *fdt);
	void __clear_close_on_exec(int fd, struct fdtable *fdt);
	bool close_on_exec(int fd, const struct fdtable *fdt);
	void __set_open_fd(int fd, struct fdtable *fdt);
	void __clear_open_fd(int fd, struct fdtable *fdt);
	bool fd_is_open(int fd, const struct fdtable *fdt);

Note that I've prepended '__' to the names of the set/clear functions because
they require the caller to hold a lock to use them.

Note also that I haven't added wrappers for looking behind the scenes at the
the array.  Possibly that should exist too.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20120216174942.23314.1364.stgit@warthog.procyon.org.uk
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2012-02-19 10:30:52 -08:00
Andrew Morton
6d4831c283 vfs: avoid large kmalloc()s for the fdtable
Azurit reports large increases in system time after 2.6.36 when running
Apache.  It was bisected down to a892e2d7dc ("vfs: use kmalloc()
to allocate fdmem if possible").

That patch caused the vfs to use kmalloc() for very large allocations and
this is causing excessive work (and presumably excessive reclaim) within
the page allocator.

Fix it by falling back to vmalloc() earlier - when the allocation attempt
would have been considered "costly" by reclaim.

Reported-by: azurIt <azurit@pobox.sk>
Tested-by: azurIt <azurit@pobox.sk>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-28 11:28:20 -07:00
Changli Gao
a892e2d7dc vfs: use kmalloc() to allocate fdmem if possible
Use kmalloc() to allocate fdmem if possible.

vmalloc() is used as a fallback solution for fdmem allocation.  A new
helper function __free_fdtable() is introduced to reduce the lines of
code.

A potential bug, vfree() a memory allocated by kmalloc(), is fixed.

[akpm@linux-foundation.org: use __GFP_NOWARN, uninline alloc_fdmem() and free_fdmem()]
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-11 08:59:02 -07:00
Paul E. McKenney
b97181f242 fs: remove all rcu head initializations, except on_stack initializations
Remove all rcu head inits. We don't care about the RCU head state before passing
it to call_rcu() anyway. Only leave the "on_stack" variants so debugobjects can
keep track of objects on stack.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andries Brouwer <aeb@cwi.nl>
2010-06-14 16:37:26 -07:00
Jiri Slaby
d554ed895d fs: use rlimit helpers
Make sure compiler won't do weird things with limits.  E.g.  fetching them
twice may return 2 different values after writable limits are implemented.

I.e.  either use rlimit helpers added in commit 3e10e716ab ("resource:
add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:29 -08:00
Paul E. McKenney
7dc5215798 vfs: Apply lockdep-based checking to rcu_dereference() uses
Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable()
and fcheck_files().

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
LKML-Reference: <1266887105-1528-8-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:34:48 +01:00
Alexey Dobriyan
d43c36dc6b headers: remove sched.h from interrupt.h
After m68k's task_thread_info() doesn't refer to current,
it's possible to remove sched.h from interrupt.h and not break m68k!
Many thanks to Heiko Carstens for allowing this.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-10-11 11:20:58 -07:00
Al Viro
1027abe882 [PATCH] merge locate_fd() and get_unused_fd()
New primitive: alloc_fd(start, flags).  get_unused_fd() and
get_unused_fd_flags() become wrappers on top of it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-08-01 11:25:23 -04:00
Al Viro
4e1e018ecc [PATCH] fix RLIM_NOFILE handling
* dup2() should return -EBADF on exceeded sysctl_nr_open
* dup() should *not* return -EINVAL even if you have rlimit set to 0;
  it should get -EMFILE instead.

Check for orig_start exceeding rlimit taken to sys_fcntl().
Failing expand_files() in dup{2,3}() now gets -EMFILE remapped to -EBADF.
Consequently, remaining checks for rlimit are taken to expand_files().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-07-26 20:53:45 -04:00
Al Viro
eceea0b3df [PATCH] avoid multiplication overflows and signedness issues for max_fds
Limit sysctl_nr_open - we don't want ->max_fds to exceed MAX_INT and
we don't want size calculation for ->fd[] to overflow.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:52 -04:00
Al Viro
adbecb128c [PATCH] dup_fd() part 4 - race fix
Parent _can_ be a clone task, contrary to the comment.  Moreover,
more files could be opened while we allocate a copy, in which case
we end up copying only part into new descriptor table.  Since what
we get _is_ affected by all changes in the old range, we can get
rather weird effects - e.g.
	dup2(0, 1024); close(0);
in parallel with fork() resulting in child that sees the effect of
close(), but not that of dup2() done just before that close().

What we need is to recalculate the open_count after having reacquired
->file_lock and if external fdtable we'd just allocated is too small for
it, free the sucker and redo allocation.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:46 -04:00
Al Viro
afbec7fff4 [PATCH] dup_fd() - part 3
merge alloc_files() into dup_fd(), leave setting newf->fdt until the end

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:39 -04:00
Al Viro
9dec3c4d30 [PATCH] dup_fd() part 2
use alloc_fdtable() instead of expand_files(), get rid of pointless
grabbing newf->file_lock, kill magic in copy_fdtable() that used to
be there only to skip copying when called from dup_fd().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:33 -04:00
Al Viro
02afc6267f [PATCH] dup_fd() fixes, part 1
Move the sucker to fs/file.c in preparation to the rest

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:26 -04:00
Al Viro
f52111b154 [PATCH] take init_files to fs/file.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-16 17:22:20 -04:00
Al Viro
5c598b3428 [PATCH] fix sysctl_nr_open bugs
* if luser with root sets it to something that is not a multiple of
  BITS_PER_LONG, the system is screwed.
* if it gets decreased at the wrong time, we can get expand_files()
  returning success and _not_ increasing the size of table as asked.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-01 13:08:57 -04:00
Al Viro
9f3acc3140 [PATCH] split linux/file.h
Initial splitoff of the low-level stuff; taken to fdtable.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-05-01 13:08:16 -04:00
Eric Dumazet
9cfe015aa4 get rid of NR_OPEN and introduce a sysctl_nr_open
NR_OPEN (historically set to 1024*1024) actually forbids processes to open
more than 1024*1024 handles.

Unfortunatly some production servers hit the not so 'ridiculously high
value' of 1024*1024 file descriptors per process.

Changing NR_OPEN is not considered safe because of vmalloc space potential
exhaust.

This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
1024*1024, so that admins can decide to change this limit if their workload
needs it.

[akpm@linux-foundation.org: export it for sparc64]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:06 -08:00
Vadim Lobanov
01b2d93ca4 [PATCH] fdtable: Provide free_fdtable() wrapper
Christoph Hellwig has expressed concerns that the recent fdtable changes
expose the details of the RCU methodology used to release no-longer-used
fdtable structures to the rest of the kernel.  The trivial patch below
addresses these concerns by introducing the appropriate free_fdtable()
calls, which simply wrap the release RCU usage.  Since free_fdtable() is a
one-liner, it makes sense to promote it to an inline helper.

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:50 -08:00
Vadim Lobanov
5466b456ed [PATCH] fdtable: Implement new pagesize-based fdtable allocator
This patch provides an improved fdtable allocation scheme, useful for
expanding fdtable file descriptor entries.  The main focus is on the fdarray,
as its memory usage grows 128 times faster than that of an fdset.

The allocation algorithm sizes the fdarray in such a way that its memory usage
increases in easy page-sized chunks. The overall algorithm expands the allowed
size in powers of two, in order to amortize the cost of invoking vmalloc() for
larger allocation sizes. Namely, the following sizes for the fdarray are
considered, and the smallest that accommodates the requested fd count is
chosen:

    pagesize / 4
    pagesize / 2
    pagesize      <- memory allocator switch point
    pagesize * 2
    pagesize * 4
    ...etc...

Unlike the current implementation, this allocation scheme does not require a
loop to compute the optimal fdarray size, and can be done in efficient
straightline code.

Furthermore, since the fdarray overflows the pagesize boundary long before any
of the fdsets do, it makes sense to optimize run-time by allocating both
fdsets in a single swoop.  Even together, they will still be, by far, smaller
than the fdarray.  The fdtable->open_fds is now used as the anchor for the
fdset memory allocation.

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Vadim Lobanov
4fd45812cb [PATCH] fdtable: Remove the free_files field
An fdtable can either be embedded inside a files_struct or standalone (after
being expanded).  When an fdtable is being discarded after all RCU references
to it have expired, we must either free it directly, in the standalone case,
or free the files_struct it is contained within, in the embedded case.

Currently the free_files field controls this behavior, but we can get rid of
it entirely, as all the necessary information is already recorded.  We can
distinguish embedded and standalone fdtables using max_fds, and if it is
embedded we can divine the relevant files_struct using container_of().

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Vadim Lobanov
bbea9f6966 [PATCH] fdtable: Make fdarray and fdsets equal in size
Currently, each fdtable supports three dynamically-sized arrays of data: the
fdarray and two fdsets.  The code allows the number of fds supported by the
fdarray (fdtable->max_fds) to differ from the number of fds supported by each
of the fdsets (fdtable->max_fdset).

In practice, it is wasteful for these two sizes to differ: whenever we hit a
limit on the smaller-capacity structure, we will reallocate the entire fdtable
and all the dynamic arrays within it, so any delta in the memory used by the
larger-capacity structure will never be touched at all.

Rather than hogging this excess, we shouldn't even allocate it in the first
place, and keep the capacities of the fdarray and the fdsets equal.  This
patch removes fdtable->max_fdset.  As an added bonus, most of the supporting
code becomes simpler.

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Tejun Heo
593be07ae8 [PATCH] file: kill unnecessary timer in fdtable_defer
free_fdtable_rc() schedules timer to reschedule fddef->wq if
schedule_work() on it returns 0.  However, schedule_work() guarantees that
the target work is executed at least once after the scheduling regardless
of its return value.  0 return simply means that the work was already
pending and thus no further action was required.

Another problem is that it used contant '5' as @expires argument to
mod_timer().

Kill unnecessary fddef->timer.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:32 -08:00
David Howells
65f27f3844 WorkStruct: Pass the work_struct pointer instead of context data
Pass the work_struct pointer to the work function rather than context data.
The work function can use container_of() to work out the data.

For the cases where the container of the work_struct may go away the moment the
pending bit is cleared, it is made possible to defer the release of the
structure by deferring the clearing of the pending bit.

To make this work, an extra flag is introduced into the management side of the
work_struct.  This governs auto-release of the structure upon execution.

Ordinarily, the work queue executor would release the work_struct for further
scheduling or deallocation by clearing the pending bit prior to jumping to the
work function.  This means that, unless the driver makes some guarantee itself
that the work_struct won't go away, the work function may not access anything
else in the work_struct or its container lest they be deallocated..  This is a
problem if the auxiliary data is taken away (as done by the last patch).

However, if the pending bit is *not* cleared before jumping to the work
function, then the work function *may* access the work_struct and its container
with no problems.  But then the work function must itself release the
work_struct by calling work_release().

In most cases, automatic release is fine, so this is the default.  Special
initiators exist for the non-auto-release case (ending in _NAR).


Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:55:48 +00:00
Andrew Morton
327dcaadc0 [PATCH] expand_fdtable(): remove pointless unlock+lock
This unlock/lock on a super-unlikely path isn't worth the kernel text.

Cc: Vadim Lobanov <vlobanov@speakeasy.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:25 -07:00
Vadim Lobanov
74d392aaab [PATCH] Clean up expand_fdtable() and expand_files()
Perform a code cleanup against the expand_fdtable() and expand_files()
functions inside fs/file.c.  It aims to make the flow of code within these
functions simpler and easier to understand, via added comments and modest
refactoring.

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:25 -07:00
Andrew Morton
8b0e330b77 [PATCH] alloc_fdtable() cleanup
free_fdset(NULL, ...) is legal.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:19 -07:00
Andrew Morton
a29b0b74e7 [PATCH] alloc_fdtable() expansion fix
We're supposed to go the next power of two if nfds==nr.

Of `nr', not of `nfsd'.

Spotted by Rene Scharfe <rene.scharfe@lsrfire.ath.cx>

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-12 12:52:55 -07:00
Kirill Korotaev
d579091b43 [PATCH] fix fdset leakage
When found, it is obvious.  nfds calculated when allocating fdsets is
rewritten by calculation of size of fdtable, and when we are unlucky, we
try to free fdsets of wrong size.

Found due to OpenVZ resource management (User Beancounters).

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-12 12:52:54 -07:00
Andrew Morton
92eb7a2f28 [PATCH] fix weird logic in alloc_fdtable()
There's a fairly obvious infinite loop in there.

Also, use roundup_pow_of_two() rather than open-coding stuff.

Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:25 -07:00
KAMEZAWA Hiroyuki
0a94502277 [PATCH] for_each_possible_cpu: fixes for generic part
replaces for_each_cpu with for_each_possible_cpu().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 09:16:05 -08:00
Eric Dumazet
0c9e63fd38 [PATCH] Shrinks sizeof(files_struct) and better layout
1) Reduce the size of (struct fdtable) to exactly 64 bytes on 32bits
   platforms, lowering kmalloc() allocated space by 50%.

2) Reduce the size of (files_struct), using a special 32 bits (or
   64bits) embedded_fd_set, instead of a 1024 bits fd_set for the
   close_on_exec_init and open_fds_init fields.  This save some ram (248
   bytes per task) as most tasks dont open more than 32 files.  D-Cache
   footprint for such tasks is also reduced to the minimum.

3) Reduce size of allocated fdset.  Currently two full pages are
   allocated, that is 32768 bits on x86 for example, and way too much.  The
   minimum is now L1_CACHE_BYTES.

UP and SMP should benefit from this patch, because most tasks will touch
only one cache line when open()/close() stdin/stdout/stderr (0/1/2),
(next_fd, close_on_exec_init, open_fds_init, fd_array[0 ..  2] being in the
same cache line)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:09 -08:00
Eric Dumazet
88a2a4ac6b [PATCH] percpu data: only iterate over possible CPUs
percpu_data blindly allocates bootmem memory to store NR_CPUS instances of
cpudata, instead of allocating memory only for possible cpus.

As a preparation for changing that, we need to convert various 0 -> NR_CPUS
loops to use for_each_cpu().

(The above only applies to users of asm-generic/percpu.h.  powerpc has gone it
alone and is presently only allocating memory for present CPUs, so it's
currently corrupting memory).

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@suse.de>
Cc: Anton Blanchard <anton@samba.org>
Acked-by: William Irwin <wli@holomorphy.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:51 -08:00
Dipankar Sarma
0b175a7e68 [PATCH] Fix the fdtable freeing in the case of vmalloced fdset/arrays
Noted by David Miller:

  "The bug is that free_fd_array() takes a "num" argument, but when
   calling it from __free_fdtable() we're instead passing in the size in
   bytes (ie.  "num * sizeof(struct file *)")."

Yes it is a bug. I think I messed it up while merging newer
changes with an older version where I was using size in bytes
to optimize.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-14 12:38:26 -07:00
Dipankar Sarma
ab2af1f500 [PATCH] files: files struct with RCU
Patch to eliminate struct files_struct.file_lock spinlock on the reader side
and use rcu refcounting rcuref_xxx api for the f_count refcounter.  The
updates to the fdtable are done by allocating a new fdtable structure and
setting files->fdt to point to the new structure.  The fdtable structure is
protected by RCU thereby allowing lock-free lookup.  For fd arrays/sets that
are vmalloced, we use keventd to free them since RCU callbacks can't sleep.  A
global list of fdtable to be freed is not scalable, so we use a per-cpu list.
If keventd is already handling the current cpu's work, we use a timer to defer
queueing of that work.

Since the last publication, this patch has been re-written to avoid using
explicit memory barriers and use rcu_assign_pointer(), rcu_dereference()
premitives instead.  This required that the fd information is kept in a
separate structure (fdtable) and updated atomically.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 13:57:55 -07:00
Dipankar Sarma
badf16621c [PATCH] files: break up files struct
In order for the RCU to work, the file table array, sets and their sizes must
be updated atomically.  Instead of ensuring this through too many memory
barriers, we put the arrays and their sizes in a separate structure.  This
patch takes the first step of putting the file table elements in a separate
structure fdtable that is embedded withing files_struct.  It also changes all
the users to refer to the file table using files_fdtable() macro.  Subsequent
applciation of RCU becomes easier after this.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 13:57:55 -07:00
Linus Torvalds
1da177e4c3 Linux-2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!
2005-04-16 15:20:36 -07:00