Commit graph

376638 commits

Author SHA1 Message Date
Tejun Heo
e2bd416f62 cgroup: fix deadlock on cgroup_mutex via drop_parsed_module_refcounts()
eb178d0633 ("cgroup: grab cgroup_mutex in
drop_parsed_module_refcounts()") made drop_parsed_module_refcounts()
grab cgroup_mutex to make lockdep assertion in for_each_subsys()
happy.  Unfortunately, cgroup_remount() calls the function while
holding cgroup_mutex in its failure path leading to the following
deadlock.

# mount -t cgroup -o remount,memory,blkio cgroup blkio

 cgroup: option changes via remount are deprecated (pid=525 comm=mount)

 =============================================
 [ INFO: possible recursive locking detected ]
 3.10.0-rc4-work+ #1 Not tainted
 ---------------------------------------------
 mount/525 is trying to acquire lock:
  (cgroup_mutex){+.+.+.}, at: [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0

 but task is already holding lock:
  (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200

 other info that might help us debug this:
  Possible unsafe locking scenario:

	CPU0
	----
   lock(cgroup_mutex);
   lock(cgroup_mutex);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 4 locks held by mount/525:
  #0:  (&type->s_umount_key#30){+.+...}, at: [<ffffffff811e9a0d>] do_mount+0x2bd/0xa30
  #1:  (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff8110e4d3>] cgroup_remount+0x43/0x200
  #2:  (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200
  #3:  (cgroup_root_mutex){+.+.+.}, at: [<ffffffff8110e4ef>] cgroup_remount+0x5f/0x200

 stack backtrace:
 CPU: 2 PID: 525 Comm: mount Not tainted 3.10.0-rc4-work+ #1
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  ffffffff829651f0 ffff88000ec2fc28 ffffffff81c24bb1 ffff88000ec2fce8
  ffffffff810f420d 0000000000000006 0000000000000001 0000000000000056
  ffff8800153b4640 ffff880000000000 ffffffff81c2e468 ffff8800153b4640
 Call Trace:
  [<ffffffff81c24bb1>] dump_stack+0x19/0x1b
  [<ffffffff810f420d>] __lock_acquire+0x15dd/0x1e60
  [<ffffffff810f531c>] lock_acquire+0x9c/0x1f0
  [<ffffffff81c2a805>] mutex_lock_nested+0x65/0x410
  [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0
  [<ffffffff8110e63e>] cgroup_remount+0x1ae/0x200
  [<ffffffff811c9bb2>] do_remount_sb+0x82/0x190
  [<ffffffff811e9d41>] do_mount+0x5f1/0xa30
  [<ffffffff811ea203>] SyS_mount+0x83/0xc0
  [<ffffffff81c2fb82>] system_call_fastpath+0x16/0x1b

Fix it by moving the drop_parsed_module_refcounts() invocation outside
cgroup_mutex.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-27 19:37:23 -07:00
Tejun Heo
a4ea1cc906 cgroup: always use RCU accessors for protected accesses
kernel/cgroup.c still has places where a RCU pointer is set and
accessed directly without going through RCU_INIT_POINTER() or
rcu_dereference_protected().  They're all properly protected accesses
so nothing is broken but it leads to spurious sparse RCU address space
warnings.

Substitute direct accesses with RCU_INIT_POINTER() and
rcu_dereference_protected().  Note that %true is specified as the
extra condition for all derference updates.  This isn't ideal as all
it does is suppressing warning without actually policing
synchronization rules; however, most are scheduled to be removed
pretty soon along with css_id itself, so no reason to be more
elaborate.

Combined with the previous changes, this removes all RCU related
sparse warnings from cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by; Li Zefan <lizefan@huawei.com>
2013-06-26 10:48:48 -07:00
Tejun Heo
a8ad805cfd cgroup: fix RCU accesses around task->cgroups
There are several places in kernel/cgroup.c where task->cgroups is
accessed and modified without going through proper RCU accessors.
None is broken as they're all lock protected accesses; however, this
still triggers sparse RCU address space warnings.

* Consistently use task_css_set() for task->cgroups dereferencing.

* Use RCU_INIT_POINTER() to clear task->cgroups to &init_css_set on
  exit.

* Remove unnecessary rcu_dereference_raw() from cset->subsys[]
  dereference in cgroup_exit().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-26 10:48:38 -07:00
Tejun Heo
14611e51a5 cgroup: fix RCU accesses to task->cgroups
task->cgroups is a RCU pointer pointing to struct css_set.  A task
switches to a different css_set on cgroup migration but a css_set
doesn't change once created and its pointers to cgroup_subsys_states
aren't RCU protected.

task_subsys_state[_check]() is the macro to acquire css given a task
and subsys_id pair.  It RCU-dereferences task->cgroups->subsys[] not
task->cgroups, so the RCU pointer task->cgroups ends up being
dereferenced without read_barrier_depends() after it.  It's broken.

Fix it by introducing task_css_set[_check]() which does
RCU-dereference on task->cgroups.  task_subsys_state[_check]() is
reimplemented to directly dereference ->subsys[] of the css_set
returned from task_css_set[_check]().

This removes some of sparse RCU warnings in cgroup.

v2: Fixed unbalanced parenthsis and there's no need to use
    rcu_dereference_raw() when !CONFIG_PROVE_RCU.  Both spotted by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
2013-06-26 10:42:46 -07:00
Tejun Heo
eb178d0633 cgroup: grab cgroup_mutex in drop_parsed_module_refcounts()
This isn't strictly necessary as all subsystems specified in
@subsys_mask are guaranteed to be pinned; however, it does spuriously
trigger lockdep warning.  Let's grab cgroup_mutex around it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-26 10:40:10 -07:00
Tejun Heo
1672d04070 cgroup: fix cgroupfs_root early destruction path
cgroupfs_root used to have ->actual_subsys_mask in addition to
->subsys_mask.  a8a648c4ac ("cgroup: remove
cgroup->actual_subsys_mask") removed it noting that the subsys_mask is
essentially temporary and doesn't belong in cgroupfs_root; however,
the patch made it impossible to tell whether a cgroupfs_root actually
has the subsystems bound or just have the bits set leading to the
following BUG when trying to mount with subsystems which are already
mounted elsewhere.

 kernel BUG at kernel/cgroup.c:1038!
 invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 ...
 CPU: 1 PID: 7973 Comm: mount Tainted: G        W    3.10.0-rc7-next-20130625-sasha-00011-g1c1dc0e #1105
 task: ffff880fc0ae8000 ti: ffff880fc0b9a000 task.ti: ffff880fc0b9a000
 RIP: 0010:[<ffffffff81249b29>]  [<ffffffff81249b29>] rebind_subsystems+0x409/0x5f0
 ...
 Call Trace:
  [<ffffffff8124bd4f>] cgroup_kill_sb+0xff/0x210
  [<ffffffff813d21af>] deactivate_locked_super+0x4f/0x90
  [<ffffffff8124f3b3>] cgroup_mount+0x673/0x6e0
  [<ffffffff81257169>] cpuset_mount+0xd9/0x110
  [<ffffffff813d2580>] mount_fs+0xb0/0x2d0
  [<ffffffff81404afd>] vfs_kern_mount+0xbd/0x180
  [<ffffffff814070b5>] do_new_mount+0x145/0x2c0
  [<ffffffff814085d6>] do_mount+0x356/0x3c0
  [<ffffffff8140873d>] SyS_mount+0xfd/0x140
  [<ffffffff854eb600>] tracesys+0xdd/0xe2

We still want rebind_subsystems() to take added/removed masks, so
let's fix it by marking whether a cgroupfs_root has finished binding
or not.  Also, document what's going on around ->subsys_mask
initialization so that similar mistakes aren't repeated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-26 10:39:46 -07:00
Tejun Heo
fc76df7061 cgroup: reserve ID 0 for dummy_root and 1 for unified hierarchy
Before 1a57423166 ("cgroup: make hierarchy_id use cyclic idr"),
hierarchy IDs were allocated from 0.  As the dummy hierarchy was
always the one first initialized, it got assigned 0 and all other
hierarchies from 1.  The patch accidentally changed the minimum
useable ID to 2.

Let's restore ID 0 for dummy_root and while at it reserve 1 for
unified hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
2013-06-25 11:53:37 -07:00
Tejun Heo
30159ec7a9 cgroup: implement for_each_[builtin_]subsys()
There are quite a few places where all loaded [builtin] subsys are
iterated.  Implement for_each_[builtin_]subsys() and replace manual
iterations with those to simplify those places a bit.  The new
iterators automatically skip NULL subsystems.  This shouldn't cause
any functional difference.

Iteration loops which scan all subsystems and then skipping modular
ones explicitly are converted to use for_each_builtin_subsys().

While at it, reorder variable declarations and adjust whitespaces a
bit in the affected functions.

v2: Add lockdep_assert_held() in for_each_subsys() and add comments
    about synchronization as suggested by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-25 11:53:37 -07:00
Tejun Heo
82fe9b0da0 cgroup: move init_css_set initialization inside cgroup_mutex
cgroup_init() was doing init_css_set initialization outside
cgroup_mutex, which is fine but we want to add lockdep annotation on
subsystem iterations and cgroup_init() will trigger it spuriously.
Move init_css_set initialization inside cgroup_mutex.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-25 11:53:37 -07:00
Tejun Heo
5549c49791 cgroup: s/for_each_subsys()/for_each_root_subsys()/
for_each_subsys() walks over subsystems attached to a hierarchy and
we're gonna add iterators which walk over all available subsystems.
Rename for_each_subsys() to for_each_root_subsys() so that it's more
appropriately named and for_each_subsys() can be used to iterate all
subsystems.

While at it, remove unnecessary underbar prefix from macro arguments,
put them inside parentheses, and adjust indentation for the two
for_each_*() macros.

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24 15:21:48 -07:00
Tejun Heo
b326f9d0db cgroup: clean up find_css_set() and friends
find_css_set() passes uninitialized on-stack template[] array to
find_existing_css_set() which sets the entries for all subsystems.
Passing around an uninitialized array is a bit icky and we want to
introduce an iterator which only iterates loaded subsystems.  Let's
initialize it on definition.

While at it, also make the following cosmetic cleanups.

* Convert to proper /** comments.

* Reorder variable declarations.

* Replace comment on synchronization with lockdep_assert_held().

This patch doesn't make any functional differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24 15:21:48 -07:00
Tejun Heo
a8a648c4ac cgroup: remove cgroup->actual_subsys_mask
cgroup curiously has two subsystem masks, ->subsys_mask and
->actual_subsys_mask.  The latter only exists because the new target
subsys_mask is passed into rebind_subsystems() via @root>subsys_mask.
rebind_subsystems() needs to know what the current mask is to decide
how to reach the target mask so ->actual_subsys_mask is used as the
temp location to remember the current state.

Adding a temporary field to a permanent data structure is rather silly
and can be misleading.  Update rebind_subsystems() to take @added_mask
and @removed_mask instead and remove @root->actual_subsys_mask.

This patch shouldn't introduce any behavior changes.

v2: Comment and description updated as suggested by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24 15:21:47 -07:00
Tejun Heo
9871bf9550 cgroup: prefix global variables with "cgroup_"
Global variable names in kernel/cgroup.c are asking for trouble -
subsys, roots, rootnode and so on.  Rename them to have "cgroup_"
prefix.

* s/subsys/cgroup_subsys/

* s/rootnode/cgroup_dummy_root/

* s/dummytop/cgroup_cummy_top/

* s/roots/cgroup_roots/

* s/root_count/cgroup_root_count/

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24 15:21:47 -07:00
Tejun Heo
02c402d985 cgroup: convert CFTYPE_* flags to enums
Purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-24 15:21:47 -07:00
Li Zefan
03c78cbebb cgroup: rename cont to cgrp
Cont is short for container. control group was named process container
at first, but then people found container already has a meaning in
linux kernel.

Clean up the leftover variable name @cont.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-19 01:22:50 -07:00
Tejun Heo
00356bd5f0 cgroup: clean up cgroup_serial_nr_cursor
cgroup_serial_nr_cursor was created atomic64_t because I thought it
was never gonna used for anything other than assigning unique numbers
to cgroups and didn't want to worry about synchronization; however,
now we're using it as an event-stamp to distinguish cgroups created
before and after certain point which assumes that it's protected by
cgroup_mutex.

Let's make it clear by making it a u64.  Also, rename it to
cgroup_serial_nr_next and make it point to the next nr to allocate so
that where it's pointing to is clear and more conventional.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
2013-06-18 11:17:32 -07:00
Li Zefan
e8c82d20a9 cgroup: convert cgroup_cft_commit() to use cgroup_for_each_descendant_pre()
We used root->allcg_list to iterate cgroup hierarchy because at that time
cgroup_for_each_descendant_pre() hasn't been invented.

tj: In cgroup_cfts_commit(), s/@serial_nr/@update_upto/, move the
    assignment right above releasing cgroup_mutex and explain what's
    going on there.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-18 11:14:22 -07:00
Li Zefan
794611a1df cgroup: make serial_nr_cursor available throughout cgroup.c
The next patch will use it to determine if a cgroup is newly created
while we're iterating the cgroup hierarchy.

tj: Rephrased the comment on top of cgroup_serial_nr_cursor.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-18 11:14:21 -07:00
Li Zefan
f57947d277 cgroup: fix memory leak in cgroup_rm_cftypes()
The memory allocated in cgroup_add_cftypes() should be freed. The
effect of this bug is we leak a bit memory everytime we unload
cfq-iosched module if blkio cgroup is enabled.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-18 09:04:30 -07:00
Li Zefan
1c8158eeae cgroup: fix umount vs cgroup_event_remove() race
commit 5db9a4d99b
 Author: Tejun Heo <tj@kernel.org>
 Date:   Sat Jul 7 16:08:18 2012 -0700

     cgroup: fix cgroup hierarchy umount race

This commit fixed a race caused by the dput() in css_dput_fn(), but
the dput() in cgroup_event_remove() can also lead to the same BUG().

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
2013-06-18 09:04:30 -07:00
Li Zefan
084457f284 cgroup: fix umount vs cgroup_cfts_commit() race
cgroup_cfts_commit() uses dget() to keep cgroup alive after cgroup_mutex
is dropped, but dget() won't prevent cgroupfs from being umounted. When
the race happens, vfs will see some dentries with non-zero refcnt while
umount is in process.

Keep running this:
  mount -t cgroup -o blkio xxx /cgroup
  umount /cgroup

And this:
  modprobe cfq-iosched
  rmmod cfs-iosched

After a while, the BUG() in shrink_dcache_for_umount_subtree() may
be triggered:

  BUG: Dentry xxx{i=0,n=blkio.yyy} still in use (1) [umount of cgroup cgroup]

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
2013-06-18 09:04:30 -07:00
Tejun Heo
6db8e85c5c cgroup: disallow rename(2) if sane_behavior
cgroup's rename(2) isn't a proper migration implementation - it can't
move the cgroup to a different parent in the hierarchy.  All it can do
is swapping the name string for that cgroup.  This isn't useful and
can mislead users to think that cgroup supports proper cgroup-level
migration.  Disallow rename(2) if sane_behavior.

v2: Fail with -EPERM instead of -EINVAL so that it matches the vfs
    return value when ->rename is not implemented as suggested by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-18 08:14:23 -07:00
Tejun Heo
f63674fd0d cgroup: update sane_behavior documentation
f12dc02014 ("cgroup: mark "tasks" cgroup file as insane") and
cc5943a781 ("cgroup: mark "notify_on_release" and "release_agent"
cgroup files insane") forgot to update the changed behavior
documentation in cgroup.h.  Update it.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-13 20:02:26 -07:00
Tejun Heo
d3daf28da1 cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.

One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.

In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.

This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.

The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.

This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.

v2: - init_cgroup_css() was calling percpu_ref_init() without checking
      the return value.  This causes two problems - the obvious lack
      of error handling and percpu_ref_init() being called from
      cgroup_init_subsys() before the allocators are up, which
      triggers warnings but doesn't cause actual problems as the
      refcnt isn't used for roots anyway.  Fix both by moving
      percpu_ref_init() to cgroup_create().

    - The base references were put too early by
      percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
      refs one extra time.  This wasn't noticeable because css's go
      through another RCU grace period before being freed.  Update
      cgroup_destroy_locked() to grab an extra reference before
      killing the refcnts.  This problem was noticed by Kent.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
2013-06-13 19:43:12 -07:00
Tejun Heo
2b0e53a7c8 Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.11
This is to receive percpu_refcount which will replace atomic_t
reference count in cgroup_subsys_state.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-13 19:42:22 -07:00
Tejun Heo
ea15f8ccdb cgroup: split cgroup destruction into two steps
Split cgroup_destroy_locked() into two steps and put the latter half
into cgroup_offline_fn() which is executed from a work item.  The
latter half is responsible for offlining the css's, removing the
cgroup from internal lists, and propagating release notification to
the parent.  The separation is to allow using percpu refcnt for css.

Note that this allows for other cgroup operations to happen between
the first and second halves of destruction, including creating a new
cgroup with the same name.  As the target cgroup is marked DEAD in the
first half and cgroup internals don't care about the names of cgroups,
this should be fine.  A comment explaining this will be added by the
next patch which implements the actual percpu refcnting.

As RCU freeing is guaranteed to happen after the second step of
destruction, we can use the same work item for both.  This patch
renames cgroup->free_work to ->destroy_work and uses it for both
purposes.  INIT_WORK() is now performed right before queueing the work
item.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 19:27:42 -07:00
Tejun Heo
455050d23e cgroup: reorder the operations in cgroup_destroy_locked()
This patch reorders the operations in cgroup_destroy_locked() such
that the userland visible parts happen before css offlining and
removal from the ->sibling list.  This will be used to make css use
percpu refcnt.

While at it, split out CGRP_DEAD related comment from the refcnt
deactivation one and correct / clarify how different guarantees are
met.

While this patch changes the specific order of operations, it
shouldn't cause any noticeable behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 19:27:41 -07:00
Tejun Heo
dbece3a0f1 percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm()
Implement percpu_tryget() which stops giving out references once the
percpu_ref is visible as killed.  Because the refcnt is per-cpu,
different CPUs will start to see a refcnt as killed at different
points in time and tryget() may continue to succeed on subset of cpus
for a while after percpu_ref_kill() returns.

For use cases where it's necessary to know when all CPUs start to see
the refcnt as dead, percpu_ref_kill_and_confirm() is added.  The new
function takes an extra argument @confirm_kill which is invoked when
the refcnt is guaranteed to be viewed as killed on all CPUs.

While this isn't the prettiest interface, it doesn't force synchronous
wait and is much safer than requiring the caller to do its own
call_rcu().

v2: Patch description rephrased to emphasize that tryget() may
    continue to succeed on some CPUs after kill() returns as suggested
    by Kent.

v3: Function comment in percpu_ref_kill_and_confirm() updated warning
    people to not depend on the implied RCU grace period from the
    confirm callback as it's an implementation detail.

Signed-off-by: Tejun Heo <tj@kernel.org>
Slightly-Grumpily-Acked-by: Kent Overstreet <koverstreet@google.com>
2013-06-13 19:23:53 -07:00
Tejun Heo
bc497bd33b percpu-refcount: implement percpu_ref_cancel_init()
Normally, percpu_ref_init() initializes and percpu_ref_kill()
initiates destruction which completes asynchronously.  The
asynchronous destruction can be problematic in init failure path where
the caller wants to destroy half-constructed object - distinguishing
half-constructed objects from the usual release method can be painful
for complex objects.

This patch implements percpu_ref_cancel_init() which synchronously
destroys the percpu_ref without invoking release.  To avoid
unintentional misuses, the function requires the ref to have finished
percpu_ref_init() but never used and triggers WARN otherwise.

v2: Explain the weird name and usage restriction in the function
    comment.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Kent Overstreet <koverstreet@google.com>
2013-06-13 11:08:27 -07:00
Tejun Heo
acac7883ee percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu()
Two small changes.

* Unlike most init functions, percpu_ref_init() allocates memory and
  may fail.  Let's mark it with __must_check in case the caller
  forgets.

* percpu_ref_kill_rcu() is unnecessarily using ACCESS_ONCE() to
  dereference @ref->pcpu_count, which can be misleading.  The pointer
  is guaranteed to be valid and visible and can't change underneath
  the function.  Drop ACCESS_ONCE().

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-13 11:08:26 -07:00
Tejun Heo
6f3d828f0f cgroup: remove cgroup->count and use
cgroup->count tracks the number of css_sets associated with the cgroup
and used only to verify that no css_set is associated when the cgroup
is being destroyed.  It's superflous as the destruction path can
simply check whether cgroup->cset_links is empty instead.

Drop cgroup->count and check ->cset_links directly from
cgroup_destroy_locked().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:18 -07:00
Tejun Heo
ddd69148bd cgroup: drop unnecessary RCU dancing from __put_css_set()
__put_css_set() does RCU read access on @cgrp across dropping
@cgrp->count so that it can continue accessing @cgrp even if the count
reached zero and destruction of the cgroup commenced.  Given that both
sides - __css_put() and cgroup_destroy_locked() - are cold paths, this
is unnecessary.  Just making cgroup_destroy_locked() grab css_set_lock
while checking @cgrp->count is enough.

Remove the RCU read locking from __put_css_set() and make
cgroup_destroy_locked() read-lock css_set_lock when checking
@cgrp->count.  This will also allow removing @cgrp->count.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:18 -07:00
Tejun Heo
54766d4a1d cgroup: rename CGRP_REMOVED to CGRP_DEAD
We will add another flag indicating that the cgroup is in the process
of being killed.  REMOVING / REMOVED is more difficult to distinguish
and cgroup_is_removing()/cgroup_is_removed() are a bit awkward.  Also,
later percpu_ref usage will involve "kill"ing the refcnt.

 s/CGRP_REMOVED/CGRP_DEAD/
 s/cgroup_is_removed()/cgroup_is_dead()

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:18 -07:00
Tejun Heo
5de0107e63 cgroup: clean up css_[try]get() and css_put()
* __css_get() isn't used by anyone.  Fold it into css_get().

* Add proper function comments to all css reference functions.

This patch is purely cosmetic.

v2: Typo fix as per Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:17 -07:00
Tejun Heo
f4f4be2bd2 cgroup: use kzalloc() instead of kmalloc()
There's no point in using kmalloc() instead of the clearing variant
for trivial stuff.  We can live dangerously elsewhere.  Use kzalloc()
instead and drop 0 inits.

While at it, do trivial code reorganization in cgroup_file_open().

This patch doesn't introduce any functional changes.

v2: I was caught in the very distant past where list_del() didn't
    poison and the initial version converted list_del()s to
    list_del_init()s too.  Li and Kent took me out of the stasis
    chamber.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:17 -07:00
Tejun Heo
69d0206c79 cgroup: bring some sanity to naming around cg_cgroup_link
cgroups and css_sets are mapped M:N and this M:N mapping is
represented by struct cg_cgroup_link which forms linked lists on both
sides.  The naming around this mapping is already confusing and struct
cg_cgroup_link exacerbates the situation quite a bit.

>From cgroup side, it starts off ->css_sets and runs through
->cgrp_link_list.  From css_set side, it starts off ->cg_links and
runs through ->cg_link_list.  This is rather reversed as
cgrp_link_list is used to iterate css_sets and cg_link_list cgroups.
Also, this is the only place which is still using the confusing "cg"
for css_sets.  This patch cleans it up a bit.

* s/cgroup->css_sets/cgroup->cset_links/
  s/css_set->cg_links/css_set->cgrp_links/
  s/cgroup_iter->cg_link/cgroup_iter->cset_link/

* s/cg_cgroup_link/cgrp_cset_link/

* s/cgrp_cset_link->cg/cgrp_cset_link->cset/
  s/cgrp_cset_link->cgrp_link_list/cgrp_cset_link->cset_link/
  s/cgrp_cset_link->cg_link_list/cgrp_cset_link->cgrp_link/

* s/init_css_set_link/init_cgrp_cset_link/
  s/free_cg_links/free_cgrp_cset_links/
  s/allocate_cg_links/allocate_cgrp_cset_links/

* s/cgl[12]/link[12]/ in compare_css_sets()

* s/saved_link/tmp_link/ s/tmp/tmp_links/ and a couple similar
  adustments.

* Comment and whiteline adjustments.

After the changes, we have

	list_for_each_entry(link, &cont->cset_links, cset_link) {
		struct css_set *cset = link->cset;

instead of

	list_for_each_entry(link, &cont->css_sets, cgrp_link_list) {
		struct css_set *cset = link->cg;

This patch is purely cosmetic.

v2: Fix broken sentences in the patch description.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:17 -07:00
Tejun Heo
5abb885573 cgroup: consistently use @cset for struct css_set variables
cgroup.c uses @cg for most struct css_set variables, which in itself
could be a bit confusing, but made much worse by the fact that there
are places which use @cg for struct cgroup variables.
compare_css_sets() epitomizes this confusion - @[old_]cg are struct
css_set while @cg[12] are struct cgroup.

It's not like the whole deal with cgroup, css_set and cg_cgroup_link
isn't already confusing enough.  Let's give it some sanity by
uniformly using @cset for all struct css_set variables.

* s/cg/cset/ for all css_set variables.

* s/oldcg/old_cset/ s/oldcgrp/old_cgrp/.  The same for the ones
  prefixed with "new".

* s/cg/cgrp/ for cgroup variables in compare_css_sets().

* s/css/cset/ for the cgroup variable in task_cgroup_from_root().

* Whiteline adjustments.

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:17 -07:00
Tejun Heo
3fc3db9a3a cgroup: remove now unused css_depth()
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-13 10:55:17 -07:00
Tejun Heo
ac899061a9 percpu-refcount: cosmetic updates
* s/percpu_ref_release/percpu_ref_func_t/ as it's customary to have _t
  postfix for types and the type is gonna be used for a different type
  of callback too.

* Add @ARG to function comments.

* Drop unnecessary and unaligned indentation from percpu_ref_init()
  function comment.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Kent Overstreet <koverstreet@google.com>
2013-06-12 20:43:06 -07:00
Tejun Heo
6a24474da8 percpu-refcount: consistently use plain (non-sched) RCU
percpu_ref_get/put() are using preempt_disable/enable() while
percpu_ref_kill() is using plain call_rcu() instead of
call_rcu_sched().  This is buggy as grace periods of the two may not
match.  Fix it by using plain RCU in percpu_ref_get/put().

(I suggested using sched RCU in the first place but there's no actual
 benefit in doing so unless we're gonna introduce different variants
 of get/put to be called while preemption is alredy disabled, which we
 definitely shouldn't.)

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Kent Overstreet <koverstreet@google.com>
2013-06-12 20:43:06 -07:00
Tejun Heo
d5c56ced77 cgroup: clean up the cftype array for the base cgroup files
* Rename it from files[] (really?) to cgroup_base_files[].

* Drop CGROUP_FILE_GENERIC_PREFIX which was defined as "cgroup." and
  used inconsistently.  Just use "cgroup." directly.

* Collect insane files at the end.  Note that only the insane ones are
  missing "cgroup." prefix.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-05 12:00:33 -07:00
Tejun Heo
cc5943a781 cgroup: mark "notify_on_release" and "release_agent" cgroup files insane
The empty cgroup notification mechanism currently implemented in
cgroup is tragically outdated.  Forking and execing userland process
stopped being a viable notification mechanism more than a decade ago.
We're gonna have a saner mechanism.  Let's make it clear that this
abomination is going away.

Mark "notify_on_release" and "release_agent" with CFTYPE_INSANE.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-06-05 12:00:33 -07:00
Tejun Heo
f12dc02014 cgroup: mark "tasks" cgroup file as insane
Some resources controlled by cgroup aren't per-task and cgroup core
allowing threads of a single thread_group to be in different cgroups
forced memcg do explicitly find the group leader and use it.  This is
gonna be nasty when transitioning to unified hierarchy and in general
we don't want and won't support granularity finer than processes.

Mark "tasks" with CFTYPE_INSANE.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: cgroups@vger.kernel.org
Cc: Vivek Goyal <vgoyal@redhat.com>
2013-06-05 12:00:33 -07:00
Kent Overstreet
c1ae6e9b4d percpu-refcount: Don't use silly cmpxchg()
The cmpxchg() was just to ensure the debug check didn't race, which was
a bit excessive. The caller is supposed to do the appropriate
synchronization, which means percpu_ref_kill() can just do a simple
store.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-03 16:04:04 -07:00
Kent Overstreet
215e262f2a percpu: implement generic percpu refcounting
This implements a refcount with similar semantics to
atomic_get()/atomic_dec_and_test() - but percpu.

It also implements two stage shutdown, as we need it to tear down the
percpu counts.  Before dropping the initial refcount, you must call
percpu_ref_kill(); this puts the refcount in "shutting down mode" and
switches back to a single atomic refcount with the appropriate
barriers (synchronize_rcu()).

It's also legal to call percpu_ref_kill() multiple times - it only
returns true once, so callers don't have to reimplement shutdown
synchronization.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style tweak]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-06-03 15:36:41 -07:00
Linus Torvalds
042dd60ca6 regulator: Fixes for v3.10
A few small fixes for v3.10, documentation things in the core and a few
 driver bugs.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJRrLnIAAoJELSic+t+oim99hQP/j3SRcv5hgUorXjK3X06Lw3o
 QhQh9uH1YvEASJcqt+eXCskHV6LE+lutb+glyOu/nS0BaNRs5KJjKMMLMyKLYOlp
 P3Y3jgn3K2Joq+o6b8u7cSrTV1ps2spKvHxoEeam5JTJF8QslNW1FUJF9C2qdDYx
 NHmKCNhz56B3yhf2N3nZuM9P85xPUH8pj48a1+fQiC0EbYyFX0LP3A20D9uH6PAl
 WGqN3BXctqtelBMkMfCRHRF3PqqBMSpeZ9gUAG+ckg7rn2ctToeB6rCh7/nGJfs1
 Jc0yzN3UtPfBZoALWqtXP/tRXnrtrK4qFAUTmNLt/TAYk3FiPzn5YC8mOCWduA+K
 awqRtZVOhpQ3kigVmHgKyvl7b/rZuTDaFAMH9PtGvltRoKo9eRIIqbZTfgAvgnOh
 S8FL7DdikMQxjz9E+eF7Rzngo79qMswq+RWA7ACemtxJD5QR9etITQZdAyuPopwL
 jnDE4ICjqnNywWibo330nmrWWJoOFmRpKCoaKvwFPflUmjnQ3TrvdhIgybDytQAh
 LnNUK4vi9tjKmaIl1ZL/LUF0J7rDwCfgQWzrRpRv4yfXDfl1UD57dwqRwJwvpYfx
 1UGxDHztLw2/Q53NEHVQxRIqIo/hrlg22JGkBIvGqBxPZKorKmIaOjhtcBqh6mfE
 eTluTNnxmWLhsk3fCMRW
 =yE/I
 -----END PGP SIGNATURE-----

Merge tag 'regulator-v3.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator fixes from Mark Brown:
 "A few small fixes for v3.10, documentation things in the core and a
  few driver bugs."

* tag 'regulator-v3.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: palmas: Fix "enable_reg" to point to the correct reg for SMPS10
  regulator: palmas: Fix incorrect condition
  regulator: core: Correct spelling mistake in comment
  regulator: dbx500: Make local symbol static
  regulator: Fix kernel-doc generation warnings.
2013-06-04 06:34:51 +09:00
Linus Torvalds
0ab60871b4 A couple jfs bug fixes for 3.10-rc5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.20 (GNU/Linux)
 
 iQIcBAABAgAGBQJRrLmSAAoJEDaohF61QIxkakoP/jo8fo2KaixVHRag1h7pBwVG
 7UJrgYmojoFUsmem+k34rXOzwc8eqv5BzTwGJe81ktatUjPkEQELdwryB4Rw6YYI
 F8/Mrc2AjHEavuTnojmENP22+YkPOW90r+k00dVk+cY9FQsc1/AhNYtzpVg9hSRX
 RBhGcgfnQQCAlIf09vr7R19oBLiHfrFNnUYJeUg/d0bxW8uiTimX7ENDTOtn61WF
 hLEABnL2tC0UtK3z/B0qNIpDE5DHHaLo8WcGjlXE3Pte4JA2rPDJwaHaUgvCEo6R
 iYCw5Dq6bL7Ce7wMP1GzY+LHgaeUg5Nknnkl5vEKZSZuy7dfC2eBlYbziqyJm73A
 0e7UgypkaNGa4p4MzM0J5OW+4bRRPzNir99BnpU8Bm6dJTY8ZeIMTTQFX2Fla8+G
 SCo6dMjEUfF+yIykDwmeZYjHl5XeKM6B3imoJ8+yD9vZLzg6MY4SUmgPEKL7BFoi
 U4aiID6N67LYAMb5vapzpcl+uHZbrw7CBnI2FanqdsAoPjath1CgrEvseBAUowpy
 EUeCLaQ05xR8ynm44tQCPQFgqBpRxAiVtpy9nPulA3cQCbwelEuLKrZB17DBfK9i
 w7f5e8nNG8gXfRuL5idGzS7g6Hp90JUhBWKh0dJC820OM7idWnXGCLbYTm7t6pvo
 m931/xUofSsXCLNwEo88
 =JZIK
 -----END PGP SIGNATURE-----

Merge tag 'jfs-3.10-rc5' of git://github.com/kleikamp/linux-shaggy

Pull jfs bugfixes from David Kleikamp:
 "A couple jfs bug fixes for 3.10-rc5"

* tag 'jfs-3.10-rc5' of git://github.com/kleikamp/linux-shaggy:
  fs/jfs: Add check if journaling to disk has been disabled in lbmRead()
  jfs: Several bugs in jfs_freeze() and jfs_unfreeze()
2013-06-04 06:33:44 +09:00
Linus Torvalds
aa4f608478 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k fix from Geert Uytterhoeven:
 "A boot lock-up on Mac, also destined for stable"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
  m68k/mac: Fix unexpected interrupt with CONFIG_EARLY_PRINTK
2013-06-03 18:09:42 +09:00
Linus Torvalds
286e050bc0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Martin Schwidefsky:
 "Recent bug fixes, one of them touches a common code file.

  It adds two #ifndef/#endif pairs to asm-generic/io.h to be able to
  override xlate_dev_kmem_ptr and xlate_dev_mem_ptr."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/pgtable: Fix gmap notifier address
  s390/dasd: fix handling of gone paths
  s390/pgtable: Fix check for pgste/storage key handling
  arch: s390: appldata: using strncpy() and strnlen() instead of sprintf()
  s390/smp: lost IPIs on cpu hotplug
  kernel: Fix s390 absolute memory access for /dev/mem
  s390/dma: do not call debug_dma after free
2013-06-03 18:04:07 +09:00
Linus Torvalds
7d80fea426 Merge branch 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:

 - Fix for yet another xattr bug which may lead to NULL deref.

 - A subtle bug in for_each_descendant_pre().  This bug requires quite
   specific conditions to trigger and isn't too likely to actually
   happen in the wild, but maybe that just makes it that much more
   nastier.

 - A warning message added for silly cgroup re-mount (not -o remount,
   but unmount followed by mount) behavior.

* 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: warn about mismatching options of a new mount of an existing hierarchy
  cgroup: fix a subtle bug in descendant pre-order walk
  cgroup: initialize xattr before calling d_instantiate()
2013-06-03 17:57:16 +09:00