Commit graph

376 commits

Author SHA1 Message Date
Eric Biggers
22d94f493b fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl
Add a new fscrypt ioctl, FS_IOC_ADD_ENCRYPTION_KEY.  This ioctl adds an
encryption key to the filesystem's fscrypt keyring ->s_master_keys,
making any files encrypted with that key appear "unlocked".

Why we need this
~~~~~~~~~~~~~~~~

The main problem is that the "locked/unlocked" (ciphertext/plaintext)
status of encrypted files is global, but the fscrypt keys are not.
fscrypt only looks for keys in the keyring(s) the process accessing the
filesystem is subscribed to: the thread keyring, process keyring, and
session keyring, where the session keyring may contain the user keyring.

Therefore, userspace has to put fscrypt keys in the keyrings for
individual users or sessions.  But this means that when a process with a
different keyring tries to access encrypted files, whether they appear
"unlocked" or not is nondeterministic.  This is because it depends on
whether the files are currently present in the inode cache.

Fixing this by consistently providing each process its own view of the
filesystem depending on whether it has the key or not isn't feasible due
to how the VFS caches work.  Furthermore, while sometimes users expect
this behavior, it is misguided for two reasons.  First, it would be an
OS-level access control mechanism largely redundant with existing access
control mechanisms such as UNIX file permissions, ACLs, LSMs, etc.
Encryption is actually for protecting the data at rest.

Second, almost all users of fscrypt actually do need the keys to be
global.  The largest users of fscrypt, Android and Chromium OS, achieve
this by having PID 1 create a "session keyring" that is inherited by
every process.  This works, but it isn't scalable because it prevents
session keyrings from being used for any other purpose.

On general-purpose Linux distros, the 'fscrypt' userspace tool [1] can't
similarly abuse the session keyring, so to make 'sudo' work on all
systems it has to link all the user keyrings into root's user keyring
[2].  This is ugly and raises security concerns.  Moreover it can't make
the keys available to system services, such as sshd trying to access the
user's '~/.ssh' directory (see [3], [4]) or NetworkManager trying to
read certificates from the user's home directory (see [5]); or to Docker
containers (see [6], [7]).

By having an API to add a key to the *filesystem* we'll be able to fix
the above bugs, remove userspace workarounds, and clearly express the
intended semantics: the locked/unlocked status of an encrypted directory
is global, and encryption is orthogonal to OS-level access control.

Why not use the add_key() syscall
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We use an ioctl for this API rather than the existing add_key() system
call because the ioctl gives us the flexibility needed to implement
fscrypt-specific semantics that will be introduced in later patches:

- Supporting key removal with the semantics such that the secret is
  removed immediately and any unused inodes using the key are evicted;
  also, the eviction of any in-use inodes can be retried.

- Calculating a key-dependent cryptographic identifier and returning it
  to userspace.

- Allowing keys to be added and removed by non-root users, but only keys
  for v2 encryption policies; and to prevent denial-of-service attacks,
  users can only remove keys they themselves have added, and a key is
  only really removed after all users who added it have removed it.

Trying to shoehorn these semantics into the keyrings syscalls would be
very difficult, whereas the ioctls make things much easier.

However, to reuse code the implementation still uses the keyrings
service internally.  Thus we get lockless RCU-mode key lookups without
having to re-implement it, and the keys automatically show up in
/proc/keys for debugging purposes.

References:

    [1] https://github.com/google/fscrypt
    [2] https://goo.gl/55cCrI#heading=h.vf09isp98isb
    [3] https://github.com/google/fscrypt/issues/111#issuecomment-444347939
    [4] https://github.com/google/fscrypt/issues/116
    [5] https://bugs.launchpad.net/ubuntu/+source/fscrypt/+bug/1770715
    [6] https://github.com/google/fscrypt/issues/128
    [7] https://askubuntu.com/questions/1130306/cannot-run-docker-on-an-encrypted-filesystem

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12 19:06:13 -07:00
Al Viro
c2c44ec20a Unbreak mount_capable()
In "consolidate the capability checks in sget_{fc,userns}())" the
wrong argument had been passed to mount_capable() by sget_fc().
That mistake had been further obscured later, when switching
mount_capable() to fs_context has moved the calculation of
bogus argument from sget_fc() to mount_capable() itself.  It
should've been fc->user_ns all along.

Screwed-up-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Christian Brauner <christian@brauner.io>
Tested-by: Christian Brauner <christian@brauner.io>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-31 12:22:32 -04:00
Al Viro
c23a0bbab3 convenience helper: get_tree_single()
counterpart of mount_single(); switch fusectl to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 22:01:58 -04:00
Al Viro
2ac295d4f0 convenience helper get_tree_nodev()
counterpart of mount_nodev().  Switch hugetlb and pseudo to it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 22:01:58 -04:00
David Howells
023d066a0d vfs: Kill sget_userns()
Kill sget_userns(), folding it into sget() as that's the only remaining
user.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-fsdevel@vger.kernel.org
2019-05-25 18:06:17 -04:00
David Howells
c80fa7c830 vfs: Provide sb->s_iflags settings in fs_context struct
Provide a field in the fs_context struct through which bits in the
sb->s_iflags superblock field can be set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-fsdevel@vger.kernel.org
2019-05-25 18:00:03 -04:00
Al Viro
c3aabf0780 move mount_capable() further out
Call graph of vfs_get_tree():
	vfs_fsconfig_locked()	# neither kernmount, nor submount
	do_new_mount()		# neither kernmount, nor submount
	fc_mount()
		afs_mntpt_do_automount()	# submount
		mount_one_hugetlbfs()		# kernmount
		pid_ns_prepare_proc()		# kernmount
		mq_create_mount()		# kernmount
		vfs_kern_mount()
			simple_pin_fs()		# kernmount
			vfs_submount()		# submount
			kern_mount()		# kernmount
			init_mount_tree()
			btrfs_mount()
			nfs_do_root_mount()

	The first two need the check (unconditionally).
init_mount_tree() is setting rootfs up; any capability
checks make zero sense for that one.  And btrfs_mount()/
nfs_do_root_mount() have the checks already done in their
callers.

	IOW, we can shift mount_capable() handling into
the two callers - one in the normal case of mount(2),
another - in fsconfig(2) handling of FSCONFIG_CMD_CREATE.
I.e. the syscalls that set a new filesystem up.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25 18:00:02 -04:00
Al Viro
059338aae3 move mount_capable() calls to vfs_get_tree()
sget_fc() is called only from ->get_tree() instances and
the only instance not calling it is legacy_get_tree(),
which calls mount_capable() directly.

In all sget_fc() callers the checks could be moved to the
very beginning of ->get_tree() - ->user_ns is not changed
in between.  So lifting the checks to the only caller of
->get_tree() is OK.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25 18:00:01 -04:00
Al Viro
20284ab742 switch mount_capable() to fs_context
now both callers of mount_capable() have access to fs_context;
the only difference is that for sget_fc() we have the possibility
of fc->global being true, while for legacy_get_tree() it's guaranteed
to be impossible.  Unify to more generic variant...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25 17:59:59 -04:00
Al Viro
2527b284de move the capability checks from sget_userns() to legacy_get_tree()
1) all call chains leading to sget_userns() pass through ->mount()
instances.
2) none of ->mount() instances is ever called directly - the only
call site is legacy_get_tree()
3) all remaining ->mount() instances end up calling sget_userns()

IOW, we might as well do the capability checks just before calling
->mount().  As for the arguments passed to mount_capable(),
in case of call chains to sget_userns() going through sget(),
we either don't call mount_capable() at all, or pass current_user_ns()
to it.  The call chains going through mount_pseudo_xattr() don't
call mount_capable() at all (SB_KERNMOUNT in flags on those).

That could've been split into smaller steps (lifting the checks
into sget(), then callers of sget(), then all the way to the
entries of every ->mount() out there, then to the sole caller),
but that would be too much churn for little benefit...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25 17:59:58 -04:00
David Howells
bb7b6b2bbd vfs: Kill mount_ns()
Kill mount_ns() as it has been replaced by vfs_get_super() in the new mount
API.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25 17:59:57 -04:00
Al Viro
0ce0cf12fc consolidate the capability checks in sget_{fc,userns}()
... into a common helper - mount_capable(type, userns)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25 17:59:56 -04:00
Al Viro
feb8ae43a7 start massaging the checks in sget_...(): move to sget_userns()
there are 3 remaining callers of sget_userns() - sget(), mount_ns()
and mount_pseudo_xattr().  Extra check in sget() is conditional
upon mount being neither KERNMOUNT nor SUBMOUNT, the identical one
in mount_ns() - upon being not KERNMOUNT; mount_pseudo_xattr()
has no such checks at all.

However, mount_ns() is never used with SUBMOUNT and mount_pseudo_xattr()
is used only for KERNMOUNT, so both would be fine with the same logics
as currently done in sget(), allowing to consolidate the entire thing
in sget_userns() itself.

That's not where these checks will end up in the long run, though -
the whole reason why they'd been done so deep in the bowels of
mount(2) was that there had been no way for a filesystem to specify
which userns to look at until it has entered ->mount().

Now there is a place where filesystem could override the defaults -
->init_fs_context().  Which allows to pull the checks out into
the callers of vfs_get_tree().  That'll take quite a bit of
massage, but that mess is possible to tease apart.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25 17:59:55 -04:00
Al Viro
ee948837d7 [fix] get rid of checking for absent device name in vfs_get_tree()
It has no business being there, it's checked by relevant ->get_tree()
as it is *and* it returns the wrong error for no reason whatsoever.

Fixes: f3a09c9201 "introduce fs_context methods"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-28 21:34:21 -04:00
David Howells
06a2ae56b5 vfs: Add some logging to the core users of the fs_context log
Add some logging to the core users of the fs_context log so that
information can be extracted from them as to the reason for failure.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28 03:29:38 -05:00
Al Viro
cb50b348c7 convenience helpers: vfs_get_super() and sget_fc()
the former is an analogue of mount_{single,nodev} for use in
->get_tree() instances, the latter - analogue of sget() for the
same.

These are fairly similar to the originals, but the callback signature
for sget_fc() is different from sget() ones, so getting bits and
pieces shared would be too convoluted; we might get around to that
later, but for now let's just remember to keep them in sync.  They
do live next to each other, and changes in either won't be hard
to spot.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28 03:29:26 -05:00
Al Viro
f3a09c9201 introduce fs_context methods
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:27 -05:00
David Howells
8d0347f6c3 convert do_remount_sb() to fs_context
Replace do_remount_sb() with a function, reconfigure_super(), that's
fs_context aware.  The fs_context is expected to be parameterised already
and have ->root pointing to the superblock to be reconfigured.

A legacy wrapper is provided that is intended to be called from the
fs_context ops when those appear, but for now is called directly from
reconfigure_super().  This wrapper invokes the ->remount_fs() superblock op
for the moment.  It is intended that the remount_fs() op will be phased
out.

The fs_context->purpose is set to FS_CONTEXT_FOR_RECONFIGURE to indicate
that the context is being used for reconfiguration.

do_umount_root() is provided to consolidate remount-to-R/O for umount and
emergency remount by creating a context and invoking reconfiguration.

do_remount(), do_umount() and do_emergency_remount_callback() are switched
to use the new process.

[AV -- fold UMOUNT and EMERGENCY_REMOUNT in; fixes the
umount / bug, gets rid of pointless complexity]
[AV -- set ->net_ns in all cases; nfs remount will need that]
[AV -- shift security_sb_remount() call into reconfigure_super(); the callers
that didn't do security_sb_remount() have NULL fc->security anyway, so it's
a no-op for them]

Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:26 -05:00
Al Viro
c9ce29ed79 vfs_get_tree(): evict the call of security_sb_kern_mount()
Right now vfs_get_tree() calls security_sb_kern_mount() (i.e.
mount MAC) unless it gets MS_KERNMOUNT or MS_SUBMOUNT in flags.
Doing it that way is both clumsy and imprecise.

Consider the callers' tree of vfs_get_tree():
vfs_get_tree()
        <- do_new_mount()
	<- vfs_kern_mount()
		<- simple_pin_fs()
		<- vfs_submount()
		<- kern_mount_data()
		<- init_mount_tree()
		<- btrfs_mount()
			<- vfs_get_tree()
		<- nfs_do_root_mount()
			<- nfs4_try_mount()
				<- nfs_fs_mount()
					<- vfs_get_tree()
			<- nfs4_referral_mount()

do_new_mount() always does need MAC (we are guaranteed that neither
MS_KERNMOUNT nor MS_SUBMOUNT will be passed there).

simple_pin_fs(), vfs_submount() and kern_mount_data() pass explicit
flags inhibiting that check.  So does nfs4_referral_mount() (the
flags there are ulimately coming from vfs_submount()).

init_mount_tree() is called too early for anything LSM-related; it
doesn't matter whether we attempt those checks, they'll do nothing.

Finally, in case of btrfs_mount() and nfs_fs_mount(), doing MAC
is pointless - either the caller will do it, or the flags are
such that we wouldn't have done it either.

In other words, the one and only case when we want that check
done is when we are called from do_new_mount(), and there we
want it unconditionally.

So let's simply move it there.  The superblock is still locked,
so nobody is going to get access to it (via ustat(2), etc.)
until we get a chance to apply the checks - we are free to
move them to any point up to where we drop ->s_umount (in
do_new_mount_fc()).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:26 -05:00
Al Viro
a0c9a8b8fd teach vfs_get_tree() to handle subtype, switch do_new_mount() to it
Roll the handling of subtypes into do_new_mount() and vfs_get_tree().  The
former determines any subtype string and hangs it off the fs_context; the
latter applies it.

Make do_new_mount() create, parameterise and commit an fs_context and
create a mount for itself rather than calling vfs_kern_mount().

[AV -- missing kstrdup()]
[AV -- ... and no kstrdup() if we get to setting ->s_submount - we
simply transfer it from fc, leaving NULL behind]
[AV -- constify ->s_submount, while we are at it]

Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:25 -05:00
David Howells
9bc61ab18b vfs: Introduce fs_context, switch vfs_kern_mount() to it.
Introduce a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount.  This is
allocated at the beginning of the mount procedure and into it is placed:

 (1) Filesystem type.

 (2) Namespaces.

 (3) Source/Device names (there may be multiple).

 (4) Superblock flags (SB_*).

 (5) Security details.

 (6) Filesystem-specific data, as set by the mount options.

Accessor functions are then provided to set up a context, parameterise it
from monolithic mount data (the data page passed to mount(2)) and tear it
down again.

A legacy wrapper is provided that implements what will be the basic
operations, wrapping access to filesystems that aren't yet aware of the
fs_context.

Finally, vfs_kern_mount() is changed to make use of the fs_context and
mount_fs() is replaced by vfs_get_tree(), called from vfs_kern_mount().
[AV -- add missing kstrdup()]
[AV -- put_cred() can be unconditional - fc->cred can't be NULL]
[AV -- take legacy_validate() contents into legacy_parse_monolithic()]
[AV -- merge KERNEL_MOUNT and USER_MOUNT]
[AV -- don't unlock superblock on success return from vfs_get_tree()]
[AV -- kill 'reference' argument of init_fs_context()]

Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:23 -05:00
Al Viro
718c43038f mount_fs: suppress MAC on MS_SUBMOUNT as well as MS_KERNMOUNT
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21 11:51:23 -05:00
Al Viro
204cc0ccf1 LSM: hide struct security_mnt_opts from any generic code
Keep void * instead, allocate on demand (in parse_str_opts, at the
moment).  Eventually both selinux and smack will be better off
with private structures with several strings in those, rather than
this "counter and two pointers to dynamically allocated arrays"
ugliness.  This commit allows to do that at leisure, without
disrupting anything outside of given module.

Changes:
	* instead of struct security_mnt_opt use an opaque pointer
initialized to NULL.
	* security_sb_eat_lsm_opts(), security_sb_parse_opts_str() and
security_free_mnt_opts() take it as var argument (i.e. as void **);
call sites are unchanged.
	* security_sb_set_mnt_opts() and security_sb_remount() take
it by value (i.e. as void *).
	* new method: ->sb_free_mnt_opts().  Takes void *, does
whatever freeing that needs to be done.
	* ->sb_set_mnt_opts() and ->sb_remount() might get NULL as
mnt_opts argument, meaning "empty".

Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21 11:48:34 -05:00
Al Viro
a10d7c22b3 LSM: split ->sb_set_mnt_opts() out of ->sb_kern_mount()
... leaving the "is it kernel-internal" logics in the caller.

Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21 11:46:42 -05:00
Al Viro
f5c0c26d90 new helper: security_sb_eat_lsm_opts()
combination of alloc_secdata(), security_sb_copy_data(),
security_sb_parse_opt_str() and free_secdata().

Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21 11:46:00 -05:00
Al Viro
6be8750b4c LSM: lift parsing LSM options into the caller of ->sb_kern_mount()
This paves the way for retaining the LSM options from a common filesystem
mount context during a mount parameter parsing phase to be instituted prior
to actual mount/reconfiguration actions.

Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21 11:45:30 -05:00
David Howells
e262e32d6b vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled
Only the mount namespace code that implements mount(2) should be using the
MS_* flags.  Suppress them inside the kernel unless uapi/linux/mount.h is
included.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
2018-12-20 16:32:56 +00:00
Amir Goldstein
1e6cb72399 fsnotify: add super block object type
Add the infrastructure to attach a mark to a super_block struct
and detach all attached marks when super block is destroyed.

This is going to be used by fanotify backend to setup super block
marks.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 15:14:01 +02:00
Linus Torvalds
aba16dc5cf Merge branch 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax
Pull IDA updates from Matthew Wilcox:
 "A better IDA API:

      id = ida_alloc(ida, GFP_xxx);
      ida_free(ida, id);

  rather than the cumbersome ida_simple_get(), ida_simple_remove().

  The new IDA API is similar to ida_simple_get() but better named.  The
  internal restructuring of the IDA code removes the bitmap
  preallocation nonsense.

  I hope the net -200 lines of code is convincing"

* 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
  ida: Change ida_get_new_above to return the id
  ida: Remove old API
  test_ida: check_ida_destroy and check_ida_alloc
  test_ida: Convert check_ida_conv to new API
  test_ida: Move ida_check_max
  test_ida: Move ida_check_leaf
  idr-test: Convert ida_check_nomem to new API
  ida: Start new test_ida module
  target/iscsi: Allocate session IDs from an IDA
  iscsi target: fix session creation failure handling
  drm/vmwgfx: Convert to new IDA API
  dmaengine: Convert to new IDA API
  ppc: Convert vas ID allocation to new IDA API
  media: Convert entity ID allocation to new IDA API
  ppc: Convert mmu context allocation to new IDA API
  Convert net_namespace to new IDA API
  cb710: Convert to new IDA API
  rsxx: Convert to new IDA API
  osd: Convert to new IDA API
  sd: Convert to new IDA API
  ...
2018-08-26 11:48:42 -07:00
Matthew Wilcox
5a66847e44 fs: Convert unnamed_dev_ida to new API
The new API is much easier for this user.  Also add kerneldoc for
get_anon_bdev().

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-08-21 23:54:16 -04:00
Kirill Tkhai
9b996468cf mm: add SHRINK_EMPTY shrinker methods return value
We need to distinguish the situations when shrinker has very small
amount of objects (see vfs_pressure_ratio() called from
super_cache_count()), and when it has no objects at all.  Currently, in
the both of these cases, shrinker::count_objects() returns 0.

The patch introduces new SHRINK_EMPTY return value, which will be used
for "no objects at all" case.  It's is a refactoring mostly, as
SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
patch, and all the magic will happen in further.

Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
c92e8e10ca fs: propagate shrinker::id to list_lru
Add list_lru::shrinker_id field and populate it by registered shrinker
id.

This will be used to set correct bit in memcg shrinkers map by lru code
in next patches, after there appeared the first related to memcg element
in list_lru.

Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
2b3648a6ff fs/super.c: refactor alloc_super()
Do two list_lru_init_memcg() calls after prealloc_super().
destroy_unused_super() in fail path is OK with this.  Next patch needs
such the order.

Link: http://lkml.kernel.org/r/153063058712.1818.3382490999719078571.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Linus Torvalds
f956d08a56 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Misc bits and pieces not fitting into anything more specific"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: delete unnecessary assignment in vfs_listxattr
  Documentation: filesystems: update filesystem locking documentation
  vfs: namei: use path_equal() in follow_dotdot()
  fs.h: fix outdated comment about file flags
  __inode_security_revalidate() never gets NULL opt_dentry
  make xattr_getsecurity() static
  vfat: simplify checks in vfat_lookup()
  get rid of dead code in d_find_alias()
  it's SB_BORN, not MS_BORN...
  msdos_rmdir(): kill BS comment
  remove rpc_rmdir()
  fs: avoid fdput() after failed fdget() in vfs_dedupe_file_range()
2018-06-04 10:14:28 -07:00
Dave Chinner
79f546a696 fs: don't scan the inode cache before SB_BORN is set
We recently had an oops reported on a 4.14 kernel in
xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
and so the m_perag_tree lookup walked into lala land.  It produces
an oops down this path during the failed mount:

  radix_tree_gang_lookup_tag+0xc4/0x130
  xfs_perag_get_tag+0x37/0xf0
  xfs_reclaim_inodes_count+0x32/0x40
  xfs_fs_nr_cached_objects+0x11/0x20
  super_cache_count+0x35/0xc0
  shrink_slab.part.66+0xb1/0x370
  shrink_node+0x7e/0x1a0
  try_to_free_pages+0x199/0x470
  __alloc_pages_slowpath+0x3a1/0xd20
  __alloc_pages_nodemask+0x1c3/0x200
  cache_grow_begin+0x20b/0x2e0
  fallback_alloc+0x160/0x200
  kmem_cache_alloc+0x111/0x4e0

The problem is that the superblock shrinker is running before the
filesystem structures it depends on have been fully set up. i.e.
the shrinker is registered in sget(), before ->fill_super() has been
called, and the shrinker can call into the filesystem before
fill_super() does it's setup work. Essentially we are exposed to
both use-after-free and use-before-initialisation bugs here.

To fix this, add a check for the SB_BORN flag in super_cache_count.
In general, this flag is not set until ->fs_mount() completes
successfully, so we know that it is set after the filesystem
setup has completed. This matches the trylock_super() behaviour
which will not let super_cache_scan() run if SB_BORN is not set, and
hence will not allow the superblock shrinker from entering the
filesystem while it is being set up or after it has failed setup
and is being torn down.

Cc: stable@kernel.org
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-11 15:37:57 -04:00
Al Viro
1c18d2a15e it's SB_BORN, not MS_BORN...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-10 15:09:41 -04:00
Tetsuo Handa
8e04944f0e mm,vmscan: Allow preallocating memory for register_shrinker().
syzbot is catching so many bugs triggered by commit 9ee332d99e
("sget(): handle failures of register_shrinker()"). That commit expected
that calling kill_sb() from deactivate_locked_super() without successful
fill_super() is safe, but the reality was different; some callers assign
attributes which are needed for kill_sb() after sget() succeeds.

For example, [1] is a report where sb->s_mode (which seems to be either
FMODE_READ | FMODE_EXCL | FMODE_WRITE or FMODE_READ | FMODE_EXCL) is not
assigned unless sget() succeeds. But it does not worth complicate sget()
so that register_shrinker() failure path can safely call
kill_block_super() via kill_sb(). Making alloc_super() fail if memory
allocation for register_shrinker() failed is much simpler. Let's avoid
calling deactivate_locked_super() from sget_userns() by preallocating
memory for the shrinker and making register_shrinker() in sget_userns()
never fail.

[1] https://syzkaller.appspot.com/bug?id=588996a25a2587be2e3a54e8646728fb9cae44e7

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+5a170e19c963a2e0df79@syzkaller.appspotmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-16 02:06:47 -04:00
Linus Torvalds
7214dd4ea9 Merge branch 'work.thaw' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs thaw updates from Al Viro:
 "An ancient series that has fallen through the cracks in the previous
  cycle"

* 'work.thaw' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  buffer.c: call thaw_super during emergency thaw
  vfs: factor sb iteration out of do_emergency_remount
2018-04-12 12:28:32 -07:00
Mateusz Guzik
08fdc8a013 buffer.c: call thaw_super during emergency thaw
There are 2 distinct freezing mechanisms - one operates on block
devices and another one directly on super blocks. Both end up with the
same result, but thaw of only one of these does not thaw the other.

In particular fsfreeze --freeze uses the ioctl variant going to the
super block. Since prior to this patch emergency thaw was not doing
a relevant thaw, filesystems frozen with this method remained
unaffected.

The patch is a hack which adds blind unfreezing.

In order to keep the super block write-locked the whole time the code
is shuffled around and the newly introduced __iterate_supers is
employed.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-19 02:21:40 -04:00
Linus Torvalds
19e7b5f994 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "All kinds of misc stuff, without any unifying topic, from various
  people.

  Neil's d_anon patch, several bugfixes, introduction of kvmalloc
  analogue of kmemdup_user(), extending bitfield.h to deal with
  fixed-endians, assorted cleanups all over the place..."

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
  alpha: osf_sys.c: use timespec64 where appropriate
  alpha: osf_sys.c: fix put_tv32 regression
  jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
  dcache: delete unused d_hash_mask
  dcache: subtract d_hash_shift from 32 in advance
  fs/buffer.c: fold init_buffer() into init_page_buffers()
  fs: fold __inode_permission() into inode_permission()
  fs: add RWF_APPEND
  sctp: use vmemdup_user() rather than badly open-coding memdup_user()
  snd_ctl_elem_init_enum_names(): switch to vmemdup_user()
  replace_user_tlv(): switch to vmemdup_user()
  new primitive: vmemdup_user()
  memdup_user(): switch to GFP_USER
  eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget()
  eventfd: fold eventfd_ctx_read() into eventfd_read()
  eventfd: convert to use anon_inode_getfd()
  nfs4file: get rid of pointless include of btrfs.h
  uvc_v4l2: clean copyin/copyout up
  vme_user: don't use __copy_..._user()
  usx2y: don't bother with memdup_user() for 16-byte structure
  ...
2018-01-31 09:25:20 -08:00
NeilBrown
f1ee616214 VFS: don't keep disconnected dentries on d_anon
The original purpose of the per-superblock d_anon list was to
keep disconnected dentries in the cache between consecutive
requests to the NFS server.  Dentries can be disconnected if
a client holds a file open and repeatedly performs IO on it,
and if the server drops the dentry, whether due to memory
pressure, server restart, or "echo 3 > /proc/sys/vm/drop_caches".

This purpose was thwarted by commit 75a6f82a0d ("freeing unlinked
file indefinitely delayed") which caused disconnected dentries
to be freed as soon as their refcount reached zero.

This means that, when a dentry being used by nfsd gets disconnected, a
new one needs to be allocated for every request (unless requests
overlap).  As the dentry has no name, no parent, and no children,
there is little of value to cache.  As small memory allocations are
typically fast (from per-cpu free lists) this likely has little cost.

This means that the original purpose of s_anon is no longer relevant:
there is no longer any need to keep disconnected dentries on a list so
they appear to be hashed.

However, s_anon now has a new use.  When you mount an NFS filesystem,
the dentry stored in s_root is just a placebo.  The "real" root dentry
is allocated using d_obtain_root() and so it kept on the s_anon list.
I don't know the reason for this, but suspect it related to NFSv4
where a mount of "server:/some/path" require NFS to look up the root
filehandle on the server, then walk down "/some" and "/path" to get
the filehandle to mount.

Whatever the reason, NFS depends on the s_anon list and on
shrink_dcache_for_umount() pruning all dentries on this list.  So we
cannot simply remove s_anon.

We could just leave the code unchanged, but apart from that being
potentially confusing, the (unfair) bit-spin-lock which protects
s_anon can become a bottle neck when lots of disconnected dentries are
being created.

So this patch renames s_anon to s_roots, and stops storing
disconnected dentries on the list.  Only dentries obtained with
d_obtain_root() are now stored on this list.  There are many fewer of
these (only NFS and NILFS2 use the call, and only during filesystem
mount) so contention on the bit-lock will not be a problem.

Possibly an alternate solution should be found for NFS and NILFS2, but
that would require understanding their needs first.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-25 20:22:07 -05:00
Al Viro
9ee332d99e sget(): handle failures of register_shrinker()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-18 15:05:07 -05:00
Al Viro
ca0168e8a7 alloc_super(): do ->s_umount initialization earlier
... so that failure exits could count on it having been
done.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-05 09:32:25 -05:00
Linus Torvalds
ca5b857cb0 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff, really no common topic here"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: grab the lock instead of blocking in __fd_install during resizing
  vfs: stop clearing close on exec when closing a fd
  include/linux/fs.h: fix comment about struct address_space
  fs: make fiemap work from compat_ioctl
  coda: fix 'kernel memory exposure attempt' in fsync
  pstore: remove unneeded unlikely()
  vfs: remove unneeded unlikely()
  stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case
  make vfs_ustat() static
  do_handle_open() should be static
  elf_fdpic: fix unused variable warning
  fold destroy_super() into __put_super()
  new helper: destroy_unused_super()
  fix address space warnings in ipc/
  acct.h: get rid of detritus
2017-11-17 12:54:01 -08:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Al Viro
c645b93098 fold destroy_super() into __put_super()
There's only one caller of destroy_super() left now.  Fold it there,
and replace those list_lru_destroy() calls with checks that they
had already been done (as they should have, when we were dropping
the last active reference).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11 13:56:37 -04:00
Al Viro
0200894d11 new helper: destroy_unused_super()
Used for disposal of super_block instances that had never been reachable
via any shared data structures.  No need for RCU delay in there, everything
can be called directly.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-11 13:55:42 -04:00
Mateusz Guzik
fa7c1d5080 vfs: factor sb iteration out of do_emergency_remount
The intent is to reduce code duplication with other code
iterating the list.

No functional changes.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-05 11:22:06 -04:00
Linus Torvalds
0f0d12728e Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount flag updates from Al Viro:
 "Another chunk of fmount preparations from dhowells; only trivial
  conflicts for that part. It separates MS_... bits (very grotty
  mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
  only a small subset of MS_... stuff).

  This does *not* convert the filesystems to new constants; only the
  infrastructure is done here. The next step in that series is where the
  conflicts would be; that's the conversion of filesystems. It's purely
  mechanical and it's better done after the merge, so if you could run
  something like

	list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

	sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
	        -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
	        -e 's/\<MS_NODEV\>/SB_NODEV/g' \
	        -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
	        -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
	        -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
	        -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
	        -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
	        -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
	        -e 's/\<MS_SILENT\>/SB_SILENT/g' \
	        -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
	        -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
	        -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
	        -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
	        $list

  and commit it with something along the lines of 'convert filesystems
  away from use of MS_... constants' as commit message, it would save a
  quite a bit of headache next cycle"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: Differentiate mount flags (MS_*) from internal superblock flags
  VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
  vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-14 18:54:01 -07:00
Jan Kara
bc8230ee8e quota: Convert dqio_mutex to rwsem
Convert dqio_mutex to rwsem and call it dqio_sem. No functional changes
yet.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 18:52:48 +02:00
David Howells
e462ec50cb VFS: Differentiate mount flags (MS_*) from internal superblock flags
Differentiate the MS_* flags passed to mount(2) from the internal flags set
in the super_block's s_flags.  s_flags are now called SB_*, with the names
and the values for the moment mirroring the MS_* flags that they're
equivalent to.

In this patch, just the headers are altered and some kernel code where
blind automated conversion isn't necessarily correct.

Note that this shows up some interesting issues:

 (1) Some MS_* flags get translated to MNT_* flags (such as MS_NODEV ->
     MNT_NODEV) without passing this on to the filesystem, but some
     filesystems set such flags anyway.

 (2) The ->remount_fs() methods of some filesystems adjust the *flags
     argument by setting MS_* flags in it, such as MS_NOATIME - but these
     flags are then scrubbed by do_remount_sb() (only the occupants of
     MS_RMT_MASK are permitted: MS_RDONLY, MS_SYNCHRONOUS, MS_MANDLOCK,
     MS_I_VERSION and MS_LAZYTIME)

I'm not sure what's the best way to solve all these cases.

Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-17 08:45:35 +01:00
David Howells
bc98a42c1f VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
Firstly by applying the following with coccinelle's spatch:

	@@ expression SB; @@
	-SB->s_flags & MS_RDONLY
	+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

	@@ expression A, SB; @@
	(
	-(!sb_rdonly(SB)) && A
	+!sb_rdonly(SB) && A
	|
	-A != (sb_rdonly(SB))
	+A != sb_rdonly(SB)
	|
	-A == (sb_rdonly(SB))
	+A == sb_rdonly(SB)
	|
	-!(sb_rdonly(SB))
	+!sb_rdonly(SB)
	|
	-A && (sb_rdonly(SB))
	+A && sb_rdonly(SB)
	|
	-A || (sb_rdonly(SB))
	+A || sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) != A
	+sb_rdonly(SB) != A
	|
	-(sb_rdonly(SB)) == A
	+sb_rdonly(SB) == A
	|
	-(sb_rdonly(SB)) && A
	+sb_rdonly(SB) && A
	|
	-(sb_rdonly(SB)) || A
	+sb_rdonly(SB) || A
	)

	@@ expression A, B, SB; @@
	(
	-(sb_rdonly(SB)) ? 1 : 0
	+sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) ? A : B
	+sb_rdonly(SB) ? A : B
	)

to remove left over excess bracketage and finally by applying:

	@@ expression A, SB; @@
	(
	-(A & MS_RDONLY) != sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
	|
	-(A & MS_RDONLY) == sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
	)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-17 08:45:34 +01:00
David Howells
1d278a8790 VFS: Kill off s_options and helpers
Kill off s_options, save/replace_mount_options() and generic_show_options()
as all filesystems now implement ->show_options() for themselves.  This
should make it easier to implement a context-based mount where the mount
options can be passed individually over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:09:21 -04:00
David Howells
dd111b31e9 VFS: Clean up whitespace in fs/namespace.c and fs/super.c
Clean up line terminal whitespace in fs/namespace.c and fs/super.c.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-06 03:27:09 -04:00
Jan Kara
7c4cc30024 bdi: Drop 'parent' argument from bdi_register[_va]()
Drop 'parent' argument of bdi_register() and bdi_register_va().  It is
always NULL.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
c1844d536d fs: Remove SB_I_DYNBDI flag
Now that all bdi structures filesystems use are properly refcounted, we
can remove the SB_I_DYNBDI flag.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
13eec2363e fs: Get proper reference for s_bdi
So far we just relied on block device to hold a bdi reference for us
while the filesystem is mounted. While that works perfectly fine, it is
a bit awkward that we have a pointer to a refcounted structure in the
superblock without proper reference. So make s_bdi hold a proper
reference to block device's BDI. No filesystem using mount_bdev()
actually changes s_bdi so this is safe and will make bdev filesystems
work the same way as filesystems needing to set up their private bdi.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
fca39346a5 fs: Provide infrastructure for dynamic BDIs in filesystems
Provide helper functions for setting up dynamically allocated
backing_dev_info structures for filesystems and cleaning them up on
superblock destruction.

CC: linux-mtd@lists.infradead.org
CC: linux-nfs@vger.kernel.org
CC: Petr Vandrovec <petr@vandrovec.name>
CC: linux-nilfs@vger.kernel.org
CC: cluster-devel@redhat.com
CC: osd-dev@open-osd.org
CC: codalist@coda.cs.cmu.edu
CC: linux-afs@lists.infradead.org
CC: ecryptfs@vger.kernel.org
CC: linux-cifs@vger.kernel.org
CC: ceph-devel@vger.kernel.org
CC: linux-btrfs@vger.kernel.org
CC: v9fs-developer@lists.sourceforge.net
CC: lustre-devel@lists.lustre.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Linus Torvalds
f1ef09fde1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "There is a lot here. A lot of these changes result in subtle user
  visible differences in kernel behavior. I don't expect anything will
  care but I will revert/fix things immediately if any regressions show
  up.

  From Seth Forshee there is a continuation of the work to make the vfs
  ready for unpriviled mounts. We had thought the previous changes
  prevented the creation of files outside of s_user_ns of a filesystem,
  but it turns we missed the O_CREAT path. Ooops.

  Pavel Tikhomirov and Oleg Nesterov worked together to fix a long
  standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only
  children that are forked after the prctl are considered and not
  children forked before the prctl. The only known user of this prctl
  systemd forks all children after the prctl. So no userspace
  regressions will occur. Holding earlier forked children to the same
  rules as later forked children creates a semantic that is sane enough
  to allow checkpoing of processes that use this feature.

  There is a long delayed change by Nikolay Borisov to limit inotify
  instances inside a user namespace.

  Michael Kerrisk extends the API for files used to maniuplate
  namespaces with two new trivial ioctls to allow discovery of the
  hierachy and properties of namespaces.

  Konstantin Khlebnikov with the help of Al Viro adds code that when a
  network namespace exits purges it's sysctl entries from the dcache. As
  in some circumstances this could use a lot of memory.

  Vivek Goyal fixed a bug with stacked filesystems where the permissions
  on the wrong inode were being checked.

  I continue previous work on ptracing across exec. Allowing a file to
  be setuid across exec while being ptraced if the tracer has enough
  credentials in the user namespace, and if the process has CAP_SETUID
  in it's own namespace. Proc files for setuid or otherwise undumpable
  executables are now owned by the root in the user namespace of their
  mm. Allowing debugging of setuid applications in containers to work
  better.

  A bug I introduced with permission checking and automount is now
  fixed. The big change is to mark the mounts that the kernel initiates
  as a result of an automount. This allows the permission checks in sget
  to be safely suppressed for this kind of mount. As the permission
  check happened when the original filesystem was mounted.

  Finally a special case in the mount namespace is removed preventing
  unbounded chains in the mount hash table, and making the semantics
  simpler which benefits CRIU.

  The vfs fix along with related work in ima and evm I believe makes us
  ready to finish developing and merge fully unprivileged mounts of the
  fuse filesystem. The cleanups of the mount namespace makes discussing
  how to fix the worst case complexity of umount. The stacked filesystem
  fixes pave the way for adding multiple mappings for the filesystem
  uids so that efficient and safer containers can be implemented"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc/sysctl: Don't grab i_lock under sysctl_lock.
  vfs: Use upper filesystem inode in bprm_fill_uid()
  proc/sysctl: prune stale dentries during unregistering
  mnt: Tuck mounts under others instead of creating shadow/side mounts.
  prctl: propagate has_child_subreaper flag to every descendant
  introduce the walk_process_tree() helper
  nsfs: Add an ioctl() to return owner UID of a userns
  fs: Better permission checking for submounts
  exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
  vfs: open() with O_CREAT should not create inodes with unknown ids
  nsfs: Add an ioctl() to return the namespace type
  proc: Better ownership of files for non-dumpable tasks in user namespaces
  exec: Remove LSM_UNSAFE_PTRACE_CAP
  exec: Test the ptracer's saved cred to see if the tracee can gain caps
  exec: Don't reset euid and egid when the tracee has CAP_SETUID
  inotify: Convert to using per-namespace limits
2017-02-23 20:33:51 -08:00
Jan Kara
dc3b17cc8b block: Use pointer to backing_dev_info from request_queue
We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02 08:20:48 -07:00
Eric W. Biederman
93faccbbfa fs: Better permission checking for submounts
To support unprivileged users mounting filesystems two permission
checks have to be performed: a test to see if the user allowed to
create a mount in the mount namespace, and a test to see if
the user is allowed to access the specified filesystem.

The automount case is special in that mounting the original filesystem
grants permission to mount the sub-filesystems, to any user who
happens to stumble across the their mountpoint and satisfies the
ordinary filesystem permission checks.

Attempting to handle the automount case by using override_creds
almost works.  It preserves the idea that permission to mount
the original filesystem is permission to mount the sub-filesystem.
Unfortunately using override_creds messes up the filesystems
ordinary permission checks.

Solve this by being explicit that a mount is a submount by introducing
vfs_submount, and using it where appropriate.

vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
sget and friends know that a mount is a submount so they can take appropriate
action.

sget and sget_userns are modified to not perform any permission checks
on submounts.

follow_automount is modified to stop using override_creds as that
has proven problemantic.

do_mount is modified to always remove the new MS_SUBMOUNT flag so
that we know userspace will never by able to specify it.

autofs4 is modified to stop using current_real_cred that was put in
there to handle the previous version of submount permission checking.

cifs is modified to pass the mountpoint all of the way down to vfs_submount.

debugfs is modified to pass the mountpoint all of the way down to
trace_automount by adding a new parameter.  To make this change easier
a new typedef debugfs_automount_t is introduced to capture the type of
the debugfs automount function.

Cc: stable@vger.kernel.org
Fixes: 069d5ac9ae ("autofs:  Fix automounts by using current_real_cred()->uid")
Fixes: aeaa4a79ff ("fs: Call d_automount with the filesystems creds")
Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-02-02 04:36:12 +13:00
Jan Kara
c3b004460d quota: Remove dqonoff_mutex
The only places that were grabbing dqonoff_mutex are functions turning
quotas on and off and these are properly serialized using s_umount
semaphore. Remove dqonoff_mutex.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-11-30 08:38:07 +01:00
Jan Kara
ba6379f7e6 fs: Provide function to get superblock with exclusive s_umount
Quota code will need a variant of get_super_thawed() that returns
superblock with s_umount held in exclusive mode to serialize quota on
and quota off operations. Provide this functionality.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-11-23 12:53:00 +01:00
Oleg Nesterov
f1a9622037 fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
sb_wait_write()->percpu_rwsem_release() fools lockdep to avoid the
false-positives. Now that xfs was fixed by Dave's commit dbad7c9930
("xfs: stop holding ILOCK over filldir callbacks") we can remove it and
change freeze_super() and thaw_super() to run with s_writers.rw_sem locks
held; we add two trivial helpers for that, lockdep_sb_freeze_release()
and lockdep_sb_freeze_acquire().

xfstests-dev/check `grep -il freeze tests/*/???` does not trigger any
warning from lockdep.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-14 20:41:59 -04:00
Oleg Nesterov
89f39af129 fs/super.c: fix race between freeze_super() and thaw_super()
Change thaw_super() to check frozen != SB_FREEZE_COMPLETE rather than
frozen == SB_UNFROZEN, otherwise it can race with freeze_super() which
drops sb->s_umount after SB_FREEZE_WRITE to preserve the lock ordering.

In this case thaw_super() will wrongly call s_op->unfreeze_fs() before
it was actually frozen, and call sb_freeze_unlock() which leads to the
unbalanced percpu_up_write(). Unfortunately lockdep can't detect this,
so this triggers misc BUG_ON()'s in kernel/rcu/sync.c.

Reported-and-tested-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-14 20:00:34 -04:00
Linus Torvalds
a867d7349e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull userns vfs updates from Eric Biederman:
 "This tree contains some very long awaited work on generalizing the
  user namespace support for mounting filesystems to include filesystems
  with a backing store.  The real world target is fuse but the goal is
  to update the vfs to allow any filesystem to be supported.  This
  patchset is based on a lot of code review and testing to approach that
  goal.

  While looking at what is needed to support the fuse filesystem it
  became clear that there were things like xattrs for security modules
  that needed special treatment.  That the resolution of those concerns
  would not be fuse specific.  That sorting out these general issues
  made most sense at the generic level, where the right people could be
  drawn into the conversation, and the issues could be solved for
  everyone.

  At a high level what this patchset does a couple of simple things:

   - Add a user namespace owner (s_user_ns) to struct super_block.

   - Teach the vfs to handle filesystem uids and gids not mapping into
     to kuids and kgids and being reported as INVALID_UID and
     INVALID_GID in vfs data structures.

  By assigning a user namespace owner filesystems that are mounted with
  only user namespace privilege can be detected.  This allows security
  modules and the like to know which mounts may not be trusted.  This
  also allows the set of uids and gids that are communicated to the
  filesystem to be capped at the set of kuids and kgids that are in the
  owning user namespace of the filesystem.

  One of the crazier corner casees this handles is the case of inodes
  whose i_uid or i_gid are not mapped into the vfs.  Most of the code
  simply doesn't care but it is easy to confuse the inode writeback path
  so no operation that could cause an inode write-back is permitted for
  such inodes (aka only reads are allowed).

  This set of changes starts out by cleaning up the code paths involved
  in user namespace permirted mounts.  Then when things are clean enough
  adds code that cleanly sets s_user_ns.  Then additional restrictions
  are added that are possible now that the filesystem superblock
  contains owner information.

  These changes should not affect anyone in practice, but there are some
  parts of these restrictions that are changes in behavior.

   - Andy's restriction on suid executables that does not honor the
     suid bit when the path is from another mount namespace (think
     /proc/[pid]/fd/) or when the filesystem was mounted by a less
     privileged user.

   - The replacement of the user namespace implicit setting of MNT_NODEV
     with implicitly setting SB_I_NODEV on the filesystem superblock
     instead.

     Using SB_I_NODEV is a stronger form that happens to make this state
     user invisible.  The user visibility can be managed but it caused
     problems when it was introduced from applications reasonably
     expecting mount flags to be what they were set to.

  There is a little bit of work remaining before it is safe to support
  mounting filesystems with backing store in user namespaces, beyond
  what is in this set of changes.

   - Verifying the mounter has permission to read/write the block device
     during mount.

   - Teaching the integrity modules IMA and EVM to handle filesystems
     mounted with only user namespace root and to reduce trust in their
     security xattrs accordingly.

   - Capturing the mounters credentials and using that for permission
     checks in d_automount and the like.  (Given that overlayfs already
     does this, and we need the work in d_automount it make sense to
     generalize this case).

  Furthermore there are a few changes that are on the wishlist:

   - Get all filesystems supporting posix acls using the generic posix
     acls so that posix_acl_fix_xattr_from_user and
     posix_acl_fix_xattr_to_user may be removed.  [Maintainability]

   - Reducing the permission checks in places such as remount to allow
     the superblock owner to perform them.

   - Allowing the superblock owner to chown files with unmapped uids and
     gids to something that is mapped so the files may be treated
     normally.

  I am not considering even obvious relaxations of permission checks
  until it is clear there are no more corner cases that need to be
  locked down and handled generically.

  Many thanks to Seth Forshee who kept this code alive, and putting up
  with me rewriting substantial portions of what he did to handle more
  corner cases, and for his diligent testing and reviewing of my
  changes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
  fs: Call d_automount with the filesystems creds
  fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
  evm: Translate user/group ids relative to s_user_ns when computing HMAC
  dquot: For now explicitly don't support filesystems outside of init_user_ns
  quota: Handle quota data stored in s_user_ns in quota_setxquota
  quota: Ensure qids map to the filesystem
  vfs: Don't create inodes with a uid or gid unknown to the vfs
  vfs: Don't modify inodes with a uid or gid unknown to the vfs
  cred: Reject inodes with invalid ids in set_create_file_as()
  fs: Check for invalid i_uid in may_follow_link()
  vfs: Verify acls are valid within superblock's s_user_ns.
  userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
  fs: Refuse uid/gid changes which don't map into s_user_ns
  selinux: Add support for unprivileged mounts from user namespaces
  Smack: Handle labels consistently in untrusted mounts
  Smack: Add support for unprivileged mounts from user namespaces
  fs: Treat foreign mounts as nosuid
  fs: Limit file caps to the user namespace of the super block
  userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
  userns: Remove implicit MNT_NODEV fragility.
  ...
2016-07-29 15:54:19 -07:00
Dave Chinner
6c60d2b574 fs/fs-writeback.c: add a new writeback list for sync
wait_sb_inodes() currently does a walk of all inodes in the filesystem
to find dirty one to wait on during sync.  This is highly inefficient
and wastes a lot of CPU when there are lots of clean cached inodes that
we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for.  We do this by
adding inodes to a writeback list on the sb when the mapping is first
tagged as having pages under writeback.  wait_sb_inodes() can then walk
this list of "inodes under IO" and wait specifically just for the inodes
that the current sync(2) needs to wait for.

Define a couple helpers to add/remove an inode from the writeback list
and call them when the overall mapping is tagged for or cleared from
writeback.  Update wait_sb_inodes() to walk only the inodes under
writeback due to the sync.

With this change, filesystem sync times are significantly reduced for
fs' with largely populated inode caches and otherwise no other work to
do.  For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
than 0.1s when the filesystem is fully clean.

Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Eric W. Biederman
cc50a07a24 userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
Now that SB_I_NODEV controls the nodev behavior devpts can just clear
this flag during mount.  Simplifying the code and making it easier
to audit how the code works.  While still preserving the invariant
that s_iflags is only modified during mount.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:47:31 -05:00
Eric W. Biederman
67690f937c userns: Remove implicit MNT_NODEV fragility.
Replace the implict setting of MNT_NODEV on mounts that happen with
just user namespace permissions with an implicit setting of SB_I_NODEV
in s_iflags.  The visibility of the implicit MNT_NODEV has caused
problems in the past.

With this change the fragile case where an implicit MNT_NODEV needs to
be preserved in do_remount is removed.  Using SB_I_NODEV is much less
fragile as s_iflags are set during the original mount and never
changed.

In do_new_mount with the implicit setting of MNT_NODEV gone, the only
code that can affect mnt_flags is fs_fully_visible so simplify the if
statement and reduce the indentation of the code to make that clear.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:47:23 -05:00
Eric W. Biederman
a001e74cef mnt: Move the FS_USERNS_MOUNT check into sget_userns
Allowing a filesystem to be mounted by other than root in the initial
user namespace is a filesystem property not a mount namespace property
and as such should be checked in filesystem specific code.  Move the
FS_USERNS_MOUNT test into super.c:sget_userns().

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:41:55 -05:00
Eric W. Biederman
6e4eab577a fs: Add user namespace member to struct super_block
Start marking filesystems with a user namespace owner, s_user_ns.  In
this change this is only used for permission checks of who may mount a
filesystem.  Ultimately s_user_ns will be used for translating ids and
checking capabilities for filesystems mounted from user namespaces.

The default policy for setting s_user_ns is implemented in sget(),
which arranges for s_user_ns to be set to current_user_ns() and to
ensure that the mounter of the filesystem has CAP_SYS_ADMIN in that
user_ns.

The guts of sget are split out into another function sget_userns().
The function sget_userns calls alloc_super with the specified user
namespace or it verifies the existing superblock that was found
has the expected user namespace, and fails with EBUSY when it is not.
This failing prevents users with the wrong privileges mounting a
filesystem.

The reason for the split of sget_userns from sget is that in some
cases such as mount_ns and kernfs_mount_ns a different policy for
permission checking of mounts and setting s_user_ns is necessary, and
the existence of sget_userns() allows those policies to be
implemented.

The helper mount_ns is expected to be used for filesystems such as
proc and mqueuefs which present per namespace information.  The
function mount_ns is modified to call sget_userns instead of sget to
ensure the user namespace owner of the namespace whose information is
presented by the filesystem is used on the superblock.

For sysfs and cgroup the appropriate permission checks are already in
place, and kernfs_mount_ns is modified to call sget_userns so that
the init_user_ns is the only user namespace used.

For the cgroup filesystem cgroup namespace mounts are bind mounts of a
subset of the full cgroup filesystem and as such s_user_ns must be the
same for all of them as there is only a single superblock.

Mounts of sysfs that vary based on the network namespace could in principle
change s_user_ns but it keeps the analysis and implementation of kernfs
simpler if that is not supported, and at present there appear to be no
benefits from supporting a different s_user_ns on any sysfs mount.

Getting the details of setting s_user_ns correct has been
a long process.  Thanks to Pavel Tikhorirorv who spotted a leak
in sget_userns.  Thanks to Seth Forshee who has kept the work alive.

Thanks-to: Seth Forshee <seth.forshee@canonical.com>
Thanks-to: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-06-23 15:41:55 -05:00
Eric W. Biederman
d91ee87d8d vfs: Pass data, ns, and ns->userns to mount_ns
Today what is normally called data (the mount options) is not passed
to fill_super through mount_ns.

Pass the mount options and the namespace separately to mount_ns so
that filesystems such as proc that have mount options, can use
mount_ns.

Pass the user namespace to mount_ns so that the standard permission
check that verifies the mounter has permissions over the namespace can
be performed in mount_ns instead of in each filesystems .mount method.
Thus removing the duplication between mqueuefs and proc in terms of
permission checks.  The extra permission check does not currently
affect the rpc_pipefs filesystem and the nfsd filesystem as those
filesystems do not currently allow unprivileged mounts.  Without
unpvileged mounts it is guaranteed that the caller has already passed
capable(CAP_SYS_ADMIN) which guarantees extra permission check will
pass.

Update rpc_pipefs and the nfsd filesystem to ensure that the network
namespace reference is always taken in fill_super and always put in kill_sb
so that the logic is simpler and so that errors originating inside of
fill_super do not cause a network namespace leak.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23 15:41:53 -05:00
Jiri Kosina
9938b04472 Merge branch 'master' into for-next
Sync with Linus' tree so that patches against newer codebase can be applied.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-18 11:18:55 +02:00
Masanari Iida
bd7ced9881 Doc: treewide : Fix typos in DocBook/filesystem.xml
This patch fix spelling typos found in DocBook/filesystem.xml.
It is because the file was generated from comments in code,
I have to fix the comments in codes, instead of xml file.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-18 11:13:05 +02:00
Tejun Heo
a1a0e23e49 writeback: flush inode cgroup wb switches instead of pinning super_block
If cgroup writeback is in use, inodes can be scheduled for
asynchronous wb switching.  Before 5ff8eaac16 ("writeback: keep
superblock pinned during cgroup writeback association switches"), this
could race with umount leading to super_block being destroyed while
inodes are pinned for wb switching.  5ff8eaac16 fixed it by bumping
s_active while wb switches are in flight; however, this allowed
in-flight wb switches to make umounts asynchronous when the userland
expected synchronosity - e.g. fsck immediately following umount may
fail because the device is still busy.

This patch removes the problematic super_block pinning and instead
makes generic_shutdown_super() flush in-flight wb switches.  wb
switches are now executed on a dedicated isw_wq so that they can be
flushed and isw_nr_in_flight keeps track of the number of in-flight wb
switches so that flushing can be avoided in most cases.

v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
    check in inode_switch_wbs() as Jan an Al suggested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tahsin Erdogan <tahsin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: 5ff8eaac16 ("writeback: keep superblock pinned during cgroup writeback association switches")
Cc: stable@vger.kernel.org #v4.5
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:42:50 -07:00
Linus Torvalds
7d1fc01afc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  floppy: make local variable non-static
  exynos: fixes an incorrect header guard
  dt-bindings: fixes some incorrect header guards
  cpufreq-dt: correct dead link in documentation
  cpufreq: ARM big LITTLE: correct dead link in documentation
  treewide: Fix typos in printk
  Documentation: filesystem: Fix typo in fs/eventfd.c
  fs/super.c: use && instead of & for warn_on condition
  Documentation: fix sysfs-ptp
  lib: scatterlist: fix Kconfig description
2016-01-14 17:04:19 -08:00
Dmitry Monakhov
a1c6f05733 fs: use block_device name vsprintf helper
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-06 13:03:18 -05:00
Vincent Stehlé
22224a1758 fs/super.c: use && instead of & for warn_on condition
This fixes the following sparse warning:

  fs/super.c:1202:9: warning: dubious: x & !y

Bitwise and logical and are equivalent here, but logical was intended.
The generated code is identical, with and without CONFIG_LOCKDEP.

Signed-off-by: Vincent Stehlé <vincent.stehle@freescale.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-12-08 14:50:57 +01:00
Al Viro
061f98e959 Merge branch 'superblock-scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-next
Conflicts:
	include/linux/fs.h
2015-08-21 02:31:20 -04:00
Dave Chinner
e97fedb9ef sync: serialise per-superblock sync operations
When competing sync(2) calls walk the same filesystem, they need to
walk the list of inodes on the superblock to find all the inodes
that we need to wait for IO completion on. However, when multiple
wait_sb_inodes() calls do this at the same time, they contend on the
the inode_sb_list_lock and the contention causes system wide
slowdowns. In effect, concurrent sync(2) calls can take longer and
burn more CPU than if they were serialised.

Stop the worst of the contention by adding a per-sb mutex to wrap
around wait_sb_inodes() so that we only execute one sync(2) IO
completion walk per superblock superblock at a time and hence avoid
contention being triggered by concurrent sync(2) calls.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>
2015-08-17 18:39:47 -04:00
Dave Chinner
74278da9f7 inode: convert inode_sb_list_lock to per-sb
The process of reducing contention on per-superblock inode lists
starts with moving the locking to match the per-superblock inode
list. This takes the global lock out of the picture and reduces the
contention problems to within a single filesystem. This doesn't get
rid of contention as the locks still have global CPU scope, but it
does isolate operations on different superblocks form each other.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>
2015-08-17 18:39:46 -04:00
Oleg Nesterov
8129ed2964 change sb_writers to use percpu_rw_semaphore
We can remove everything from struct sb_writers except frozen
and add the array of percpu_rw_semaphore's instead.

This patch doesn't remove sb_writers->wait_unfrozen yet, we keep
it for get_super_thawed(). We will probably remove it later.

This change tries to address the following problems:

	- Firstly, __sb_start_write() looks simply buggy. It does
	  __sb_end_write() if it sees ->frozen, but if it migrates
	  to another CPU before percpu_counter_dec(), sb_wait_write()
	  can wrongly succeed if there is another task which holds
	  the same "semaphore": sb_wait_write() can miss the result
	  of the previous percpu_counter_inc() but see the result
	  of this percpu_counter_dec().

	- As Dave Hansen reports, it is suboptimal. The trivial
	  microbenchmark that writes to a tmpfs file in a loop runs
	  12% faster if we change this code to rely on RCU and kill
	  the memory barriers.

	- This code doesn't look simple. It would be better to rely
	  on the generic locking code.

	  According to Dave, this change adds the same performance
	  improvement.

Note: with this change both freeze_super() and thaw_super() will do
synchronize_sched_expedited() 3 times. This is just ugly. But:

	- This will be "fixed" by the rcu_sync changes we are going
	  to merge. After that freeze_super()->percpu_down_write()
	  will use synchronize_sched(), and thaw_super() won't use
	  synchronize() at all.

	  This doesn't need any changes in fs/super.c.

	- Once we merge rcu_sync changes, we can also change super.c
	  so that all wb_write->rw_sem's will share the single ->rss
	  in struct sb_writes, then freeze_super() will need only one
	  synchronize_sched().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>
2015-08-15 13:52:13 +02:00
Oleg Nesterov
853b39a7c8 shift percpu_counter_destroy() into destroy_super_work()
Of course, this patch is ugly as hell. It will be (partially)
reverted later. We add it to ensure that other WIP changes in
percpu_rw_semaphore won't break fs/super.c.

We do not even need this change right now, percpu_free_rwsem()
is fine in atomic context. But we are going to change this, it
will be might_sleep() after we merge the rcu_sync() patches.

And even after that we do not really need destroy_super_work(),
we will kill it in any case. Instead, destroy_super_rcu() should
just check that rss->cb_state == CB_IDLE and do call_rcu() again
in the (very unlikely) case this is not true.

So this is just the temporary kludge which helps us to avoid the
conflicts with the changes which will be (hopefully) routed via
rcu tree.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>
2015-08-15 13:52:11 +02:00
Oleg Nesterov
0e28e01f1e document rwsem_release() in sb_wait_write()
Not only we need to avoid the warning from lockdep_sys_exit(), the
caller of freeze_super() can never release this lock. Another thread
can do this, so there is another reason for rwsem_release().

Plus the comment should explain why we have to fool lockdep.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>
2015-08-15 13:52:09 +02:00
Oleg Nesterov
f4b554af99 fix the broken lockdep logic in __sb_start_write()
1. wait_event(frozen < level) without rwsem_acquire_read() is just
   wrong from lockdep perspective. If we are going to deadlock
   because the caller is buggy, lockdep can't detect this problem.

2. __sb_start_write() can race with thaw_super() + freeze_super(),
   and after "goto retry" the 2nd  acquire_freeze_lock() is wrong.

3. The "tell lockdep we are doing trylock" hack doesn't look nice.

   I think this is correct, but this logic should be more explicit.
   Yes, the recursive read_lock() is fine if we hold the lock on a
   higher level. But we do not need to fool lockdep. If we can not
   deadlock in this case then try-lock must not fail and we can use
   use wait == F throughout this code.

Note: as Dave Chinner explains, the "trylock" hack and the fat comment
can be probably removed. But this needs a separate change and it will
be trivial: just kill __sb_start_write() and rename do_sb_start_write()
back to __sb_start_write().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>
2015-08-15 13:52:09 +02:00
Wang YanQing
1af95de6f0 fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
Execution of get_anon_bdev concurrently and preemptive kernel all
could bring race condition, it isn't enough to check dev against
its upper limitation with equality operator only.

This patch fix it.

Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-01 01:50:06 -04:00
Vladimir Davydov
3cb29d1117 cleancache: remove limit on the number of cleancache enabled filesystems
The limit equals 32 and is imposed by the number of entries in the
fs_poolid_map and shared_fs_poolid_map.  Nowadays it is insufficient,
because with containers on board a Linux host can have hundreds of
active fs mounts.

These maps were introduced by commit 49a9ab815a ("mm: cleancache:
lazy initialization to allow tmem backends to build/run as modules") in
order to allow compiling cleancache drivers as modules.  Real pool ids
are stored in these maps while super_block->cleancache_poolid points to
an entry in the map, so that on cleancache registration we can walk over
all (if there are <= 32 of them, of course) cleancache-enabled super
blocks and assign real pool ids.

Actually, there is absolutely no need in these maps, because we can
iterate over all super blocks immediately using iterate_supers.  This is
not racy, because cleancache_init_ops is called from mount_fs with
super_block->s_umount held for writing, while iterate_supers takes this
semaphore for reading, so if we call iterate_supers after setting
cleancache_ops, all super blocks that had been created before
cleancache_register_ops was called will be assigned pool ids by the
action function of iterate_supers while all newer super blocks will
receive it in cleancache_init_fs.

This patch therefore removes the maps and hence the artificial limit on
the number of cleancache enabled filesystems.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Stefan Hengelein <ilendir@googlemail.com>
Cc: Florian Schmaus <fschmaus@gmail.com>
Cc: Andor Daam <andor.daam@googlemail.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:03 -07:00
Konstantin Khlebnikov
eb6ef3df4f trylock_super(): replacement for grab_super_passive()
I've noticed significant locking contention in memory reclaimer around
sb_lock inside grab_super_passive(). Grab_super_passive() is called from
two places: in icache/dcache shrinkers (function super_cache_scan) and
from writeback (function __writeback_inodes_wb). Both are required for
progress in memory allocator.

Grab_super_passive() acquires sb_lock to increment sb->s_count and check
sb->s_instances. It seems sb->s_umount locked for read is enough here:
super-block deactivation always runs under sb->s_umount locked for write.
Protecting super-block itself isn't a problem: in super_cache_scan() sb
is protected by shrinker_rwsem: it cannot be freed if its slab shrinkers
are still active. Inside writeback super-block comes from inode from bdi
writeback list under wb->list_lock.

This patch removes locking sb_lock and checks s_instances under s_umount:
generic_shutdown_super() unlinks it under sb->s_umount locked for write.
New variant is called trylock_super() and since it only locks semaphore,
callers must call up_read(&sb->s_umount) instead of drop_super(sb) when
they're done.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:38:42 -05:00
Linus Torvalds
50652963ea Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc VFS updates from Al Viro:
 "This cycle a lot of stuff sits on topical branches, so I'll be sending
  more or less one pull request per branch.

  This is the first pile; more to follow in a few.  In this one are
  several misc commits from early in the cycle (before I went for
  separate branches), plus the rework of mntput/dput ordering on umount,
  switching to use of fs_pin instead of convoluted games in
  namespace_unlock()"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch the IO-triggering parts of umount to fs_pin
  new fs_pin killing logics
  allow attaching fs_pin to a group not associated with some superblock
  get rid of the second argument of acct_kill()
  take count and rcu_head out of fs_pin
  dcache: let the dentry count go down to zero without taking d_lock
  pull bumping refcount into ->kill()
  kill pin_put()
  mode_t whack-a-mole: chelsio
  file->f_path.dentry is pinned down for as long as the file is open...
  get rid of lustre_dump_dentry()
  gut proc_register() a bit
  kill d_validate()
  ncpfs: get rid of d_validate() nonsense
  selinuxfs: don't open-code d_genocide()
2015-02-17 14:56:45 -08:00
Linus Torvalds
818099574b Merge branch 'akpm' (patches from Andrew)
Merge third set of updates from Andrew Morton:

 - the rest of MM

   [ This includes getting rid of the numa hinting bits, in favor of
     just generic protnone logic.  Yay.     - Linus ]

 - core kernel

 - procfs

 - some of lib/ (lots of lib/ material this time)

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (104 commits)
  lib/lcm.c: replace include
  lib/percpu_ida.c: remove redundant includes
  lib/strncpy_from_user.c: replace module.h include
  lib/stmp_device.c: replace module.h include
  lib/sort.c: move include inside #if 0
  lib/show_mem.c: remove redundant include
  lib/radix-tree.c: change to simpler include
  lib/plist.c: remove redundant include
  lib/nlattr.c: remove redundant include
  lib/kobject_uevent.c: remove redundant include
  lib/llist.c: remove redundant include
  lib/md5.c: simplify include
  lib/list_sort.c: rearrange includes
  lib/genalloc.c: remove redundant include
  lib/idr.c: remove redundant include
  lib/halfmd4.c: simplify includes
  lib/dynamic_queue_limits.c: simplify includes
  lib/sort.c: use simpler includes
  lib/interval_tree.c: simplify includes
  hexdump: make it return number of bytes placed in buffer
  ...
2015-02-12 18:54:28 -08:00
Vladimir Davydov
49e7e7ff8d fs: shrinker: always scan at least one object of each type
In super_cache_scan() we divide the number of objects of particular type
by the total number of objects in order to distribute pressure among As a
result, in some corner cases we can get nr_to_scan=0 even if there are
some objects to reclaim, e.g.  dentries=1, inodes=1, fs_objects=1,
nr_to_scan=1/3=0.

This is unacceptable for per memcg kmem accounting, because this means
that some objects may never get reclaimed after memcg death, preventing it
from being freed.

This patch therefore assures that super_cache_scan() will scan at least
one object of each type if any.

[akpm@linux-foundation.org: add comment]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:09 -08:00
Vladimir Davydov
2acb60a046 fs: make shrinker memcg aware
Now, to make any list_lru-based shrinker memcg aware we should only
initialize its list_lru as memcg aware.  Let's do it for the general FS
shrinker (super_block::s_shrink).

There are other FS-specific shrinkers that use list_lru for storing
objects, such as XFS and GFS2 dquot cache shrinkers, but since they
reclaim objects that are shared among different cgroups, there is no point
making them memcg aware.  It's a big question whether we should account
them to memcg at all.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:09 -08:00
Vladimir Davydov
c0a5b56093 list_lru: organize all list_lrus to list
To make list_lru memcg aware, we need all list_lrus to be kept on a list
protected by a mutex, so that we could sleep while walking over the
list.

Therefore after this change list_lru_destroy may sleep.  Fortunately,
there is only one user that calls it from an atomic context - it's
put_super - and we can easily fix it by calling list_lru_destroy before
put_super in destroy_locked_super - anyway we don't longer need lrus by
that time.

Another point that should be noted is that list_lru_destroy is allowed
to be called on an uninitialized zeroed-out object, in which case it is
a no-op.  Before this patch this was guaranteed by kfree, but now we
need an explicit check there.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:09 -08:00
Vladimir Davydov
4101b62435 fs: consolidate {nr,free}_cached_objects args in shrink_control
We are going to make FS shrinkers memcg-aware.  To achieve that, we will
have to pass the memcg to scan to the nr_cached_objects and
free_cached_objects VFS methods, which currently take only the NUMA node
to scan.  Since the shrink_control structure already holds the node, and
the memcg to scan will be added to it when we introduce memcg-aware
vmscan, let us consolidate the methods' arguments in this structure to
keep things clean.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Vladimir Davydov
503c358cf1 list_lru: introduce list_lru_shrink_{count,walk}
Kmem accounting of memcg is unusable now, because it lacks slab shrinker
support.  That means when we hit the limit we will get ENOMEM w/o any
chance to recover.  What we should do then is to call shrink_slab, which
would reclaim old inode/dentry caches from this cgroup.  This is what
this patch set is intended to do.

Basically, it does two things.  First, it introduces the notion of
per-memcg slab shrinker.  A shrinker that wants to reclaim objects per
cgroup should mark itself as SHRINKER_MEMCG_AWARE.  Then it will be
passed the memory cgroup to scan from in shrink_control->memcg.  For
such shrinkers shrink_slab iterates over the whole cgroup subtree under
the target cgroup and calls the shrinker for each kmem-active memory
cgroup.

Secondly, this patch set makes the list_lru structure per-memcg.  It's
done transparently to list_lru users - everything they have to do is to
tell list_lru_init that they want memcg-aware list_lru.  Then the
list_lru will automatically distribute objects among per-memcg lists
basing on which cgroup the object is accounted to.  This way to make FS
shrinkers (icache, dcache) memcg-aware we only need to make them use
memcg-aware list_lru, and this is what this patch set does.

As before, this patch set only enables per-memcg kmem reclaim when the
pressure goes from memory.limit, not from memory.kmem.limit.  Handling
memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
it is still unclear whether we will have this knob in the unified
hierarchy.

This patch (of 9):

NUMA aware slab shrinkers use the list_lru structure to distribute
objects coming from different NUMA nodes to different lists.  Whenever
such a shrinker needs to count or scan objects from a particular node,
it issues commands like this:

        count = list_lru_count_node(lru, sc->nid);
        freed = list_lru_walk_node(lru, sc->nid, isolate_func,
                                   isolate_arg, &sc->nr_to_scan);

where sc is an instance of the shrink_control structure passed to it
from vmscan.

To simplify this, let's add special list_lru functions to be used by
shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
consolidate the nid and nr_to_scan arguments in the shrink_control
structure.

This will also allow us to avoid patching shrinkers that use list_lru
when we make shrink_slab() per-memcg - all we will have to do is extend
the shrink_control structure to include the target memcg and make
list_lru_shrink_{count,walk} handle this appropriately.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Al Viro
15d0f5ea34 Make super_blocks and sb_lock static
The only user outside of fs/super.c is gone now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-02-02 10:07:59 -07:00
Al Viro
fdab684d72 allow attaching fs_pin to a group not associated with some superblock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-25 23:17:28 -05:00
Christoph Hellwig
df0ce26cb4 fs: remove default_backing_dev_info
Now that default_backing_dev_info is not used for writeback purposes we can
git rid of it easily:

 - instead of using it's name for tracing unregistered bdi we just use
   "unknown"
 - btrfs and ceph can just assign the default read ahead window themselves
   like several other filesystems already do.
 - we can assign noop_backing_dev_info as the default one in alloc_super.
   All filesystems already either assigned their own or
   noop_backing_dev_info.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:05:38 -07:00
Linus Torvalds
77c688ac87 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "The big thing in this pile is Eric's unmount-on-rmdir series; we
  finally have everything we need for that.  The final piece of prereqs
  is delayed mntput() - now filesystem shutdown always happens on
  shallow stack.

  Other than that, we have several new primitives for iov_iter (Matt
  Wilcox, culled from his XIP-related series) pushing the conversion to
  ->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
  cleanups and fixes (including the external name refcounting, which
  gives consistent behaviour of d_move() wrt procfs symlinks for long
  and short names alike) and assorted cleanups and fixes all over the
  place.

  This is just the first pile; there's a lot of stuff from various
  people that ought to go in this window.  Starting with
  unionmount/overlayfs mess...  ;-/"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
  fs/file_table.c: Update alloc_file() comment
  vfs: Deduplicate code shared by xattr system calls operating on paths
  reiserfs: remove pointless forward declaration of struct nameidata
  don't need that forward declaration of struct nameidata in dcache.h anymore
  take dname_external() into fs/dcache.c
  let path_init() failures treated the same way as subsequent link_path_walk()
  fix misuses of f_count() in ppp and netlink
  ncpfs: use list_for_each_entry() for d_subdirs walk
  vfs: move getname() from callers to do_mount()
  gfs2_atomic_open(): skip lookups on hashed dentry
  [infiniband] remove pointless assignments
  gadgetfs: saner API for gadgetfs_create_file()
  f_fs: saner API for ffs_sb_create_file()
  jfs: don't hash direct inode
  [s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
  ecryptfs: ->f_op is never NULL
  android: ->f_op is never NULL
  nouveau: __iomem misannotations
  missing annotation in fs/file.c
  fs: namespace: suppress 'may be used uninitialized' warnings
  ...
2014-10-13 11:28:42 +02:00
Tetsuo Handa
475d0db742 fs: Fix theoretical division by 0 in super_cache_scan().
total_objects could be 0 and is used as a denom.

While total_objects is a "long", total_objects == 0 unlikely happens for
3.12 and later kernels because 32-bit architectures would not be able to
hold (1 << 32) objects. However, total_objects == 0 may happen for kernels
between 3.1 and 3.11 because total_objects in prune_super() was an "int"
and (e.g.) x86_64 architecture might be able to hold (1 << 32) objects.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable <stable@kernel.org> # 3.1+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:02 -04:00