Commit graph

132 commits

Author SHA1 Message Date
Amir Goldstein
137ec526a2 ovl: create helper ovl_create_temp()
Also used ovl_create_temp() in ovl_create_index() instead of calling
ovl_do_mkdir() directly, so now all callers of ovl_do_mkdir() are routed
through ovl_create_real(), which paves the way for Al's fix for non-hashed
result from vfs_mkdir().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:11 +02:00
Miklos Szeredi
95a1c8153a ovl: return dentry from ovl_create_real()
Al Viro suggested to simplify callers of ovl_create_real() by
returning the created dentry (or ERR_PTR) from ovl_create_real().

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:11 +02:00
Amir Goldstein
471ec5dcf4 ovl: struct cattr cleanups
* Rename to ovl_cattr

* Fold ovl_create_real() hardlink argument into struct ovl_cattr

* Create macro OVL_CATTR() to initialize struct ovl_cattr from mode

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:10 +02:00
Amir Goldstein
6cf00764b0 ovl: strip debug argument from ovl_do_ helpers
It did not prove to be useful.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:10 +02:00
Amir Goldstein
a8b9e0ceed ovl: remove WARN_ON() real inode attributes mismatch
Overlayfs should cope with online changes to underlying layer
without crashing the kernel, which is what xfstest overlay/019
checks.

This test may sometimes trigger WARN_ON() in ovl_create_or_link()
when linking an overlay inode that has been changed on underlying
layer.

Remove those WARN_ON() to prevent the stress test from failing.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:10 +02:00
Amir Goldstein
e7dd0e7134 ovl: whiteout index when union nlink drops to zero
With NFS export feature enabled, when overlay inode nlink drops to
zero, instead of removing the index entry, replace it with a whiteout
index entry.

This is needed for NFS export in order to prevent future open by handle
from opening the lower file directly.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24 11:25:56 +01:00
Amir Goldstein
6d0a8a90a5 ovl: take lower dir inode mutex outside upper sb_writers lock
The functions ovl_lower_positive() and ovl_check_empty_dir() both take
inode mutex on the real lower dir under ovl_want_write() which takes
the upper_mnt sb_writers lock.

While this is not a clear locking order or layering violation, it creates
an undesired lock dependency between two unrelated layers for no good
reason.

This lock dependency materializes to a false(?) positive lockdep warning
when calling rmdir() on a nested overlayfs, where both nested and
underlying overlayfs both use the same fs type as upper layer.

rmdir() on the nested overlayfs creates the lock chain:
  sb_writers of upper_mnt (e.g. tmpfs) in ovl_do_remove()
  ovl_i_mutex_dir_key[] of lower overlay dir in ovl_lower_positive()

rmdir() on the underlying overlayfs creates the lock chain in
reverse order:
  ovl_i_mutex_dir_key[] of lower overlay dir in vfs_rmdir()
  sb_writers of nested upper_mnt (e.g. tmpfs) in ovl_do_remove()

To rid of the unneeded locking dependency, move both ovl_lower_positive()
and ovl_check_empty_dir() to before ovl_want_write() in rmdir() and
rename() implementation.

This change spreads the pieces of ovl_check_empty_and_clear() directly
inside the rmdir()/rename() implementations so the helper is no longer
needed and removed.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-19 17:43:23 +01:00
Amir Goldstein
da2e6b7eed ovl: fix overlay: warning prefix
Conform two stray warning messages to the standard overlayfs: prefix.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-12-14 11:14:52 +01:00
Amir Goldstein
f30536f0f9 ovl: update cache version of impure parent on rename
ovl_rename() updates dir cache version for impure old parent if an entry
with copy up origin is moved into old parent, but it did not update
cache version if the entry moved out of old parent has a copy up origin.

[SzM] Same for new dir: we updated the version if an entry with origin was
moved in, but not if an entry with origin was moved out.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-11-09 10:23:27 +01:00
zhangyi (F)
07f6fff148 ovl: fix rmdir problem on non-merge dir with origin xattr
An "origin && non-merge" upper dir may have leftover whiteouts that
were created in past mount. overlayfs does no clear this dir when we
delete it, which may lead to rmdir fail or temp file left in workdir.

Simple reproducer:
  mkdir lower upper work merge
  mkdir -p lower/dir
  touch lower/dir/a
  mount -t overlay overlay -olowerdir=lower,upperdir=upper,\
    workdir=work merge
  rm merge/dir/a
  umount merge
  rm -rf lower/*
  touch lower/dir  (*)
  mount -t overlay overlay -olowerdir=lower,upperdir=upper,\
    workdir=work merge
  rm -rf merge/dir

Syslog dump:
  overlayfs: cleanup of 'work/#7' failed (-39)

(*): if we do not create the regular file, the result is different:
  rm: cannot remove "dir/": Directory not empty

This patch adds a check for the case of non-merge dir that may contain
whiteouts, and calls ovl_check_empty_dir() to check and clear whiteouts
from upper dir when an empty dir is being deleted.

[amir: split patch from ovl_check_empty_dir() cleanup
       rename ovl_is_origin() to ovl_may_have_whiteouts()
       check OVL_WHITEOUTS flag instead of checking origin xattr]

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-11-09 10:23:27 +01:00
zhangyi (F)
95e598e7ac ovl: simplify ovl_check_empty_and_clear()
Filter out non-whiteout non-upper entries from list of merge dir entries
while checking if merge dir is empty in ovl_check_empty_dir().
The remaining work for ovl_clear_empty() is to clear all entries on the
list.

[amir: split patch from rmdir bug fix]

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-11-09 10:23:27 +01:00
Amir Goldstein
5820dc0888 ovl: fix missing unlock_rename() in ovl_do_copy_up()
Use the ovl_lock_rename_workdir() helper which requires
unlock_rename() only on lock success.

Fixes: ("fd210b7d67ee ovl: move copy up lock out")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-10-05 15:53:18 +02:00
Michal Hocko
0ee931c4e3 mm: treewide: remove GFP_TEMPORARY allocation flag
GFP_TEMPORARY was introduced by commit e12ba74d8f ("Group short-lived
and reclaimable kernel allocations") along with __GFP_RECLAIMABLE.  It's
primary motivation was to allow users to tell that an allocation is
short lived and so the allocator can try to place such allocations close
together and prevent long term fragmentation.  As much as this sounds
like a reasonable semantic it becomes much less clear when to use the
highlevel GFP_TEMPORARY allocation flag.  How long is temporary? Can the
context holding that memory sleep? Can it take locks? It seems there is
no good answer for those questions.

The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
__GFP_RECLAIMABLE which in itself is tricky because basically none of
the existing caller provide a way to reclaim the allocated memory.  So
this is rather misleading and hard to evaluate for any benefits.

I have checked some random users and none of them has added the flag
with a specific justification.  I suspect most of them just copied from
other existing users and others just thought it might be a good idea to
use without any measuring.  This suggests that GFP_TEMPORARY just
motivates for cargo cult usage without any reasoning.

I believe that our gfp flags are quite complex already and especially
those with highlevel semantic should be clearly defined to prevent from
confusion and abuse.  Therefore I propose dropping GFP_TEMPORARY and
replace all existing users to simply use GFP_KERNEL.  Please note that
SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
so they will be placed properly for memory fragmentation prevention.

I can see reasons we might want some gfp flag to reflect shorterm
allocations but I propose starting from a clear semantic definition and
only then add users with proper justification.

This was been brought up before LSF this year by Matthew [1] and it
turned out that GFP_TEMPORARY really doesn't have a clear semantic.  It
seems to be a heuristic without any measured advantage for most (if not
all) its current users.  The follow up discussion has revealed that
opinions on what might be temporary allocation differ a lot between
developers.  So rather than trying to tweak existing users into a
semantic which they haven't expected I propose to simply remove the flag
and start from scratch if we really need a semantic for short term
allocations.

[1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org

[akpm@linux-foundation.org: fix typo]
[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: drm/i915: fix up]
  Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-13 18:53:16 -07:00
Miklos Szeredi
4edb83bb10 ovl: constant d_ino for non-merge dirs
Impure directories are ones which contain objects with origins (i.e. those
that have been copied up).  These are relevant to readdir operation only
because of the d_ino field, no other transformation is necessary.  Also a
directory can become impure between two getdents(2) calls.

This patch creates a cache for impure directories.  Unlike the cache for
merged directories, this one only contains entries with origin and is not
refcounted but has a its lifetime tied to that of the dentry.

Similarly to the merged cache, the impure cache is invalidated based on a
version number.  This version number is incremented when an entry with
origin is added or removed from the directory.

If the cache is empty, then the impure xattr is removed from the directory.

This patch also fixes up handling of d_ino for the ".." entry if the parent
directory is merged.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-27 21:54:06 +02:00
Amir Goldstein
ea3dad18dc ovl: mark parent impure on ovl_link()
When linking a file with copy up origin into a new parent, mark the
new parent dir "impure".

Fixes: ee1d6d37b6 ("ovl: mark upper dir with type origin entries "impure"")
Cc: <stable@vger.kernel.org> # v4.12
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-13 22:06:45 +02:00
Amir Goldstein
5f8415d6b8 ovl: persistent overlay inode nlink for indexed inodes
With inodes index enabled, an overlay inode nlink counts the union of upper
and non-covered lower hardlinks. During the lifetime of a non-pure upper
inode, the following nlink modifying operations can happen:

1. Lower hardlink copy up
2. Upper hardlink created, unlinked or renamed over
3. Lower hardlink whiteout or renamed over

For the first, copy up case, the union nlink does not change, whether the
operation succeeds or fails, but the upper inode nlink may change.
Therefore, before copy up, we store the union nlink value relative to the
lower inode nlink in the index inode xattr trusted.overlay.nlink.

For the second, upper hardlink case, the union nlink should be incremented
or decremented IFF the operation succeeds, aligned with nlink change of the
upper inode. Therefore, before link/unlink/rename, we store the union nlink
value relative to the upper inode nlink in the index inode.

For the last, lower cover up case, we simplify things by preceding the
whiteout or cover up with copy up. This makes sure that there is an index
upper inode where the nlink xattr can be stored before the copied up upper
entry is unlink.

Return the overlay inode nlinks for indexed upper inodes on stat(2).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-04 22:03:19 +02:00
Miklos Szeredi
55acc66182 ovl: add flag for upper in ovl_entry
For rename, we need to ensure that an upper alias exists for hard links
before attempting the operation.  Introduce a flag in ovl_entry to track
the state of the upper alias.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-04 22:03:18 +02:00
Amir Goldstein
415543d5c6 ovl: cleanup bad and stale index entries on mount
Bad index entries are entries whose name does not match the
origin file handle stored in trusted.overlay.origin xattr.
Bad index entries could be a result of a system power off in
the middle of copy up.

Stale index entries are entries whose origin file handle is
stale. Stale index entries could be a result of copying layers
or removing lower entries while the overlay is not mounted.
The case of copying layers should be detected earlier by the
verification of upper root dir origin and index dir origin.

Both bad and stale index entries are detected and removed
on mount.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-04 22:03:17 +02:00
Miklos Szeredi
09d8b58673 ovl: move __upperdentry to ovl_inode
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-04 22:03:16 +02:00
Miklos Szeredi
9020df3720 ovl: compare inodes
When checking for consistency in directory operations (unlink, rename,
etc.) match inodes not dentries.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-04 22:03:16 +02:00
Amir Goldstein
f681eb1d5c ovl: fix nlink leak in ovl_rename()
This patch fixes an overlay inode nlink leak in the case where
ovl_rename() renames over a non-dir.

This is not so critical, because overlay inode doesn't rely on
nlink dropping to zero for inode deletion.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-04 22:03:15 +02:00
Amir Goldstein
f3a1568582 ovl: mark upper merge dir with type origin entries "impure"
An upper dir is marked "impure" to let ovl_iterate() know that this
directory may contain non pure upper entries whose d_ino may need to be
read from the origin inode.

We already mark a non-merge dir "impure" when moving a non-pure child
entry inside it, to let ovl_iterate() know not to iterate the non-merge
dir directly.

Mark also a merge dir "impure" when moving a non-pure child entry inside
it and when copying up a child entry inside it.

This can be used to optimize ovl_iterate() to perform a "pure merge" of
upper and lower directories, merging the content of the directories,
without having to read d_ino from origin inodes.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-29 11:48:00 +02:00
Amir Goldstein
ee1d6d37b6 ovl: mark upper dir with type origin entries "impure"
When moving a merge dir or non-dir with copy up origin into a non-merge
upper dir (a.k.a pure upper dir), we are marking the target parent dir
"impure". ovl_iterate() iterates pure upper dirs directly, because there is
no need to filter out whiteouts and merge dir content with lower dir. But
for the case of an "impure" upper dir, ovl_iterate() will not be able to
iterate the real upper dir directly, because it will need to lookup the
origin inode and use it to fill d_ino.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-19 09:33:49 +02:00
Miklos Szeredi
3d27573ce3 ovl: remove unused arg from ovl_lookup_temp()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-19 09:33:49 +02:00
Amir Goldstein
21a2287811 ovl: handle rename when upper doesn't support xattr
On failure to set opaque/redirect xattr on rename, skip setting xattr and
return -EXDEV.

On failure to set opaque xattr when creating a new directory, -EIO is
returned instead of -EOPNOTSUPP.

Any failure to set those xattr will be recorded in super block and
then setting any xattr on upper won't be attempted again.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-19 09:33:49 +02:00
Amir Goldstein
5b6c9053fb ovl: persistent inode numbers for upper hardlinks
An upper type non directory dentry that is a copy up target
should have a reference to its lower copy up origin.

There are three ways for an upper type dentry to be instantiated:
1. A lower type dentry that is being copied up
2. An entry that is found in upper dir by ovl_lookup()
3. A negative dentry is hardlinked to an upper type dentry

In the first case, the lower reference is set before copy up.
In the second case, the lower reference is found by ovl_lookup().
In the last case of hardlinked upper dentry, it is not easy to
update the lower reference of the negative dentry.  Instead,
drop the newly hardlinked negative dentry from dcache and let
the next access call ovl_lookup() to find its lower reference.

This makes sure that the inode number reported by stat(2) after
the hardlink is created is the same inode number that will be
reported by stat(2) after mount cycle, which is the inode number
of the lower copy up origin of the hardlink source.

NOTE that this does not fix breaking of lower hardlinks on copy
up, but only fixes the case of lower nlink == 1, whose upper copy
up inode is hardlinked in upper dir.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Miklos Szeredi
5b712091a3 ovl: merge getattr for dir and nondir
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
b7a807dc20 ovl: persistent inode number for directories
stat(2) on overlay directories reports the overlay temp inode
number, which is constant across copy up, but is not persistent.

When all layers are on the same fs, report the copy up origin inode
number for directories.

This inode number is persistent, unique across the overlay mount and
constant across copy up.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
4a99f3c83d ovl: do not set overlay.opaque on non-dir create
The optimization for opaque dir create was wrongly being applied
also to non-dir create.

Fixes: 97c684cc91 ("ovl: create directories inside merged parent opaque")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.10
2017-04-26 14:33:44 +02:00
David Howells
a528d35e8b statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.

The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode.  This change is propagated to the vfs_getattr*()
function.

Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.

========
OVERVIEW
========

The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.

A number of requests were gathered for features to be included.  The
following have been included:

 (1) Make the fields a consistent size on all arches and make them large.

 (2) Spare space, request flags and information flags are provided for
     future expansion.

 (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
     __s64).

 (4) Creation time: The SMB protocol carries the creation time, which could
     be exported by Samba, which will in turn help CIFS make use of
     FS-Cache as that can be used for coherency data (stx_btime).

     This is also specified in NFSv4 as a recommended attribute and could
     be exported by NFSD [Steve French].

 (5) Lightweight stat: Ask for just those details of interest, and allow a
     netfs (such as NFS) to approximate anything not of interest, possibly
     without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
     Dilger] (AT_STATX_DONT_SYNC).

 (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
     its cached attributes are up to date [Trond Myklebust]
     (AT_STATX_FORCE_SYNC).

And the following have been left out for future extension:

 (7) Data version number: Could be used by userspace NFS servers [Aneesh
     Kumar].

     Can also be used to modify fill_post_wcc() in NFSD which retrieves
     i_version directly, but has just called vfs_getattr().  It could get
     it from the kstat struct if it used vfs_xgetattr() instead.

     (There's disagreement on the exact semantics of a single field, since
     not all filesystems do this the same way).

 (8) BSD stat compatibility: Including more fields from the BSD stat such
     as creation time (st_btime) and inode generation number (st_gen)
     [Jeremy Allison, Bernd Schubert].

 (9) Inode generation number: Useful for FUSE and userspace NFS servers
     [Bernd Schubert].

     (This was asked for but later deemed unnecessary with the
     open-by-handle capability available and caused disagreement as to
     whether it's a security hole or not).

(10) Extra coherency data may be useful in making backups [Andreas Dilger].

     (No particular data were offered, but things like last backup
     timestamp, the data version number and the DOS archive bit would come
     into this category).

(11) Allow the filesystem to indicate what it can/cannot provide: A
     filesystem can now say it doesn't support a standard stat feature if
     that isn't available, so if, for instance, inode numbers or UIDs don't
     exist or are fabricated locally...

     (This requires a separate system call - I have an fsinfo() call idea
     for this).

(12) Store a 16-byte volume ID in the superblock that can be returned in
     struct xstat [Steve French].

     (Deferred to fsinfo).

(13) Include granularity fields in the time data to indicate the
     granularity of each of the times (NFSv4 time_delta) [Steve French].

     (Deferred to fsinfo).

(14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
     Note that the Linux IOC flags are a mess and filesystems such as Ext4
     define flags that aren't in linux/fs.h, so translation in the kernel
     may be a necessity (or, possibly, we provide the filesystem type too).

     (Some attributes are made available in stx_attributes, but the general
     feeling was that the IOC flags were to ext[234]-specific and shouldn't
     be exposed through statx this way).

(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
     Michael Kerrisk].

     (Deferred, probably to fsinfo.  Finding out if there's an ACL or
     seclabal might require extra filesystem operations).

(16) Femtosecond-resolution timestamps [Dave Chinner].

     (A __reserved field has been left in the statx_timestamp struct for
     this - if there proves to be a need).

(17) A set multiple attributes syscall to go with this.

===============
NEW SYSTEM CALL
===============

The new system call is:

	int ret = statx(int dfd,
			const char *filename,
			unsigned int flags,
			unsigned int mask,
			struct statx *buffer);

The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat().  There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.

Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):

 (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
     respect.

 (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
     its attributes with the server - which might require data writeback to
     occur to get the timestamps correct.

 (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
     network filesystem.  The resulting values should be considered
     approximate.

mask is a bitmask indicating the fields in struct statx that are of
interest to the caller.  The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat().  It should be noted that asking for
more information may entail extra I/O operations.

buffer points to the destination for the data.  This must be 256 bytes in
size.

======================
MAIN ATTRIBUTES RECORD
======================

The following structures are defined in which to return the main attribute
set:

	struct statx_timestamp {
		__s64	tv_sec;
		__s32	tv_nsec;
		__s32	__reserved;
	};

	struct statx {
		__u32	stx_mask;
		__u32	stx_blksize;
		__u64	stx_attributes;
		__u32	stx_nlink;
		__u32	stx_uid;
		__u32	stx_gid;
		__u16	stx_mode;
		__u16	__spare0[1];
		__u64	stx_ino;
		__u64	stx_size;
		__u64	stx_blocks;
		__u64	__spare1[1];
		struct statx_timestamp	stx_atime;
		struct statx_timestamp	stx_btime;
		struct statx_timestamp	stx_ctime;
		struct statx_timestamp	stx_mtime;
		__u32	stx_rdev_major;
		__u32	stx_rdev_minor;
		__u32	stx_dev_major;
		__u32	stx_dev_minor;
		__u64	__spare2[14];
	};

The defined bits in request_mask and stx_mask are:

	STATX_TYPE		Want/got stx_mode & S_IFMT
	STATX_MODE		Want/got stx_mode & ~S_IFMT
	STATX_NLINK		Want/got stx_nlink
	STATX_UID		Want/got stx_uid
	STATX_GID		Want/got stx_gid
	STATX_ATIME		Want/got stx_atime{,_ns}
	STATX_MTIME		Want/got stx_mtime{,_ns}
	STATX_CTIME		Want/got stx_ctime{,_ns}
	STATX_INO		Want/got stx_ino
	STATX_SIZE		Want/got stx_size
	STATX_BLOCKS		Want/got stx_blocks
	STATX_BASIC_STATS	[The stuff in the normal stat struct]
	STATX_BTIME		Want/got stx_btime{,_ns}
	STATX_ALL		[All currently available stuff]

stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.

Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution.  Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.

The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does.  The following
attributes map to FS_*_FL flags and are the same numerical value:

	STATX_ATTR_COMPRESSED		File is compressed by the fs
	STATX_ATTR_IMMUTABLE		File is marked immutable
	STATX_ATTR_APPEND		File is append-only
	STATX_ATTR_NODUMP		File is not to be dumped
	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs

Within the kernel, the supported flags are listed by:

	KSTAT_ATTR_FS_IOC_FLAGS

[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]

New flags include:

	STATX_ATTR_AUTOMOUNT		Object is an automount trigger

These are for the use of GUI tools that might want to mark files specially,
depending on what they are.

Fields in struct statx come in a number of classes:

 (0) stx_dev_*, stx_blksize.

     These are local system information and are always available.

 (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
     stx_size, stx_blocks.

     These will be returned whether the caller asks for them or not.  The
     corresponding bits in stx_mask will be set to indicate whether they
     actually have valid values.

     If the caller didn't ask for them, then they may be approximated.  For
     example, NFS won't waste any time updating them from the server,
     unless as a byproduct of updating something requested.

     If the values don't actually exist for the underlying object (such as
     UID or GID on a DOS file), then the bit won't be set in the stx_mask,
     even if the caller asked for the value.  In such a case, the returned
     value will be a fabrication.

     Note that there are instances where the type might not be valid, for
     instance Windows reparse points.

 (2) stx_rdev_*.

     This will be set only if stx_mode indicates we're looking at a
     blockdev or a chardev, otherwise will be 0.

 (3) stx_btime.

     Similar to (1), except this will be set to 0 if it doesn't exist.

=======
TESTING
=======

The following test program can be used to test the statx system call:

	samples/statx/test-statx.c

Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.

Here's some example output.  Firstly, an NFS directory that crosses to
another FSID.  Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.

	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
	statx(/warthog/data) = 0
	results=7ff
	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
	Device: 00:26           Inode: 1703937     Links: 125
	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
	Access: 2016-11-24 09:02:12.219699527+0000
	Modify: 2016-11-17 10:44:36.225653653+0000
	Change: 2016-11-17 10:44:36.225653653+0000
	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

Secondly, the result of automounting on that directory.

	[root@andromeda ~]# /tmp/test-statx /warthog/data
	statx(/warthog/data) = 0
	results=7ff
	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
	Device: 00:27           Inode: 2           Links: 125
	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
	Access: 2016-11-24 09:02:12.219699527+0000
	Modify: 2016-11-17 10:44:36.225653653+0000
	Change: 2016-11-17 10:44:36.225653653+0000

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-03-02 20:51:15 -05:00
Al Viro
32a3d848eb ovl: clean up kstat usage
FWIW, there's a bit of abuse of struct kstat in overlayfs object
creation paths - for one thing, it ends up with a very small subset
of struct kstat (mode + rdev), for another it also needs link in
case of symlinks and ends up passing it separately.

IMO it would be better to introduce a separate object for that.

In principle, we might even lift that thing into general API and switch
 ->mkdir()/->mknod()/->symlink() to identical calling conventions.  Hell
knows, perhaps ->create() as well...

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:57 +01:00
Amir Goldstein
97c684cc91 ovl: create directories inside merged parent opaque
The benefit of making directories opaque on creation is that lookups can
stop short when they reach the original created directory, instead of
continue lookup the entire depth of parent directory stack.

The best case is overlay with N layers, performing lookup for first level
directory, which exists only in upper.  In that case, there will be only
one lookup instead of N.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:57 +01:00
Miklos Szeredi
5cf5b477f0 ovl: opaque cleanup
oe->opaque is set for

 a) whiteouts
 b) directories having the "trusted.overlay.opaque" xattr

Case b can be simplified, since setting the xattr always implies setting
oe->opaque.  Also once set, the opaque flag is never cleared.

Don't need to set opaque flag for non-directories.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:57 +01:00
Miklos Szeredi
3ea22a71b6 ovl: allow setting max size of redirect
Add a module option to allow tuning the max size of absolute redirects.
Default is 256.

Size of relative redirects is naturally limited by the the underlying
filesystem's max filename length (usually 255).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:57 +01:00
Amir Goldstein
d15951198e ovl: check for emptiness of redirect dir
Before introducing redirect_dir feature, the condition
!ovl_lower_positive(dentry) for a directory, implied that it is a pure
upper directory, which may be removed if empty.

Now that directory can be redirect, it is possible that upper does not
cover any lower (i.e. !ovl_lower_positive(dentry)), but the directory is a
merge (with redirected path) and maybe non empty.

Check for this case in ovl_remove_upper().

This change fixes the following test case from rename-pop-dir.py
of unionmount-testsuite:

    """Remove dir and rename old name"""
    d = ctx.non_empty_dir()
    d2 = ctx.no_dir()

    ctx.rmdir(d, err=ENOTEMPTY)
    ctx.rename(d, d2)
    ctx.rmdir(d, err=ENOENT)
    ctx.rmdir(d2, err=ENOTEMPTY)

./run --ov rename-pop-dir
/mnt/a/no_dir103: Expected error (Directory not empty) was not produced

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:57 +01:00
Miklos Szeredi
a6c6065511 ovl: redirect on rename-dir
Current code returns EXDEV when a directory would need to be copied up to
move.  We could copy up the directory tree in this case, but there's
another, simpler solution: point to old lower directory from moved upper
directory.

This is achieved with a "trusted.overlay.redirect" xattr storing the path
relative to the root of the overlay.  After such attribute has been set,
the directory can be moved without further actions required.

This is a backward incompatible feature, old kernels won't be able to
correctly mount an overlay containing redirected directories.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:56 +01:00
Miklos Szeredi
3ee23ff102 ovl: check lower existence of rename target
Check if something exists on the lower layer(s) under the target or rename
to decide if directory needs to be marked "opaque".

Marking opaque is done before the rename, and on failure the marking was
undone.  Also the opaque xattr was removed if the target didn't cover
anything.

This patch changes behavior so that removal of "opaque" is not done in
either of the above cases.  This means that directory may have the opaque
flag even if it doesn't cover anything.  However this shouldn't affect the
performance or semantics of the overalay, while simplifying the code.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:55 +01:00
Miklos Szeredi
370e55ace5 ovl: rename: simplify handling of lower/merged directory
d_is_dir() is safe to call on a negative dentry.  Use this fact to simplify
handling of the lower or merged directories.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:55 +01:00
Miklos Szeredi
38e813db61 ovl: get rid of PURE type
The remainging uses of __OVL_PATH_PURE can be replaced by
ovl_dentry_is_opaque().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:55 +01:00
Miklos Szeredi
2aff4534b6 ovl: check lower existence when removing
Currently ovl_lookup() checks existence of lower file even if there's a
non-directory on upper (which is always opaque).  This is done so that
remove can decide whether a whiteout is needed or not.

It would be better to defer this check to unlink, since most of the time
the gathered information about opaqueness will be unused.

This adds a helper ovl_lower_positive() that checks if there's anything on
the lower layer(s).

The following patches also introduce changes to how the "opaque" attribute
is updated on directories: this attribute is added when the directory is
creted or moved over a whiteout or object covering something on the lower
layer.  However following changes will allow the attribute to remain on the
directory after being moved, even if the new location doesn't cover
anything.  Because of this, we need to check lower layers even for opaque
directories, so that whiteout is only created when necessary.

This function will later be also used to decide about marking a directory
opaque, so deal with negative dentries as well.  When dealing with
negative, it's enough to check for being a whiteout

If the dentry is positive but not upper then it also obviously needs
whiteout/opaque.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:55 +01:00
Miklos Szeredi
c412ce4983 ovl: add ovl_dentry_is_whiteout()
And use it instead of ovl_dentry_is_opaque() where appropriate.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:55 +01:00
Miklos Szeredi
99f5d08e36 ovl: don't check sticky
Since commit 07a2daab49 ("ovl: Copy up underlying inode's ->i_mode to
overlay inode") sticky checking on overlay inode is performed by the vfs,
so checking against sticky on underlying inode is not needed.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:55 +01:00
Miklos Szeredi
804032fabb ovl: don't check rename to self
This is redundant, the vfs already performed this check (and was broken,
see commit 9409e22acd ("vfs: rename: check backing inode being equal")).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:55 +01:00
Miklos Szeredi
ca4c8a3a80 ovl: treat special files like a regular fs
No sense in opening special files on the underlying layers, they work just
as well if opened on the overlay.

Side effect is that it's no longer possible to connect one side of a pipe
opened on overlayfs with the other side opened on the underlying layer.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:55 +01:00
Miklos Szeredi
6c02cb59e6 ovl: rename ovl_rename2() to ovl_rename()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:55 +01:00
Linus Torvalds
1a892b485f Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
 "This update contains fixes to the "use mounter's permission to access
  underlying layers" area, and miscellaneous other fixes and cleanups.

  No new features this time"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: use vfs_get_link()
  vfs: add vfs_get_link() helper
  ovl: use generic_readlink
  ovl: explain error values when removing acl from workdir
  ovl: Fix info leak in ovl_lookup_temp()
  ovl: during copy up, switch to mounter's creds early
  ovl: lookup: do getxattr with mounter's permission
  ovl: copy_up_xattr(): use strnlen
2016-10-14 17:23:33 -07:00
Linus Torvalds
101105b171 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 ">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Replace current_fs_time() with current_time()
  fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
  fs: Replace CURRENT_TIME with current_time() for inode timestamps
  fs: proc: Delete inode time initializations in proc_alloc_inode()
  vfs: Add current_time() api
  vfs: add note about i_op->rename changes to porting
  fs: rename "rename2" i_op to "rename"
  vfs: remove unused i_op->rename
  fs: make remaining filesystems use .rename2
  libfs: support RENAME_NOREPLACE in simple_rename()
  fs: support RENAME_NOREPLACE for local filesystems
  ncpfs: fix unused variable warning
2016-10-10 20:16:43 -07:00
Linus Torvalds
97d2116708 Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "xattr stuff from Andreas

  This completes the switch to xattr_handler ->get()/->set() from
  ->getxattr/->setxattr/->removexattr"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: Remove {get,set,remove}xattr inode operations
  xattr: Stop calling {get,set,remove}xattr inode operations
  vfs: Check for the IOP_XATTR flag in listxattr
  xattr: Add __vfs_{get,set,remove}xattr helpers
  libfs: Use IOP_XATTR flag for empty directory handling
  vfs: Use IOP_XATTR flag for bad-inode handling
  vfs: Add IOP_XATTR inode operations flag
  vfs: Move xattr_resolve_name to the front of fs/xattr.c
  ecryptfs: Switch to generic xattr handlers
  sockfs: Get rid of getxattr iop
  sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
  kernfs: Switch to generic xattr handlers
  hfs: Switch to generic xattr handlers
  jffs2: Remove jffs2_{get,set,remove}xattr macros
  xattr: Remove unnecessary NULL attribute name check
2016-10-10 17:11:50 -07:00
Andreas Gruenbacher
fd50ecaddf vfs: Remove {get,set,remove}xattr inode operations
These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 21:48:36 -04:00
Linus Torvalds
a3443cda55 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:

  SELinux/LSM:
   - overlayfs support, necessary for container filesystems

  LSM:
   - finally remove the kernel_module_from_file hook

  Smack:
   - treat signal delivery as an 'append' operation

  TPM:
   - lots of bugfixes & updates

  Audit:
   - new audit data type: LSM_AUDIT_DATA_FILE

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (47 commits)
  Revert "tpm/tpm_crb: implement tpm crb idle state"
  Revert "tmp/tpm_crb: fix Intel PTT hw bug during idle state"
  Revert "tpm/tpm_crb: open code the crb_init into acpi_add"
  Revert "tmp/tpm_crb: implement runtime pm for tpm_crb"
  lsm,audit,selinux: Introduce a new audit data type LSM_AUDIT_DATA_FILE
  tmp/tpm_crb: implement runtime pm for tpm_crb
  tpm/tpm_crb: open code the crb_init into acpi_add
  tmp/tpm_crb: fix Intel PTT hw bug during idle state
  tpm/tpm_crb: implement tpm crb idle state
  tpm: add check for minimum buffer size in tpm_transmit()
  tpm: constify TPM 1.x header structures
  tpm/tpm_crb: fix the over 80 characters checkpatch warring
  tpm/tpm_crb: drop useless cpu_to_le32 when writing to registers
  tpm/tpm_crb: cache cmd_size register value.
  tmp/tpm_crb: drop include to platform_device
  tpm/tpm_tis: remove unused itpm variable
  tpm_crb: fix incorrect values of cmdReady and goIdle bits
  tpm_crb: refine the naming of constants
  tpm_crb: remove wmb()'s
  tpm_crb: fix crb_req_canceled behavior
  ...
2016-10-04 14:48:27 -07:00
Miklos Szeredi
2773bf00ae fs: rename "rename2" i_op to "rename"
Generated patch:

sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-27 11:03:58 +02:00
Richard Weinberger
6a45b3628c ovl: Fix info leak in ovl_lookup_temp()
The function uses the memory address of a struct dentry as unique id.
While the address-based directory entry is only visible to root it is IMHO
still worth fixing since the temporary name does not have to be a kernel
address.  It can be any unique number.  Replace it by an atomic integer
which is allowed to wrap around.

Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v3.18+
Fixes: e9be9d5e76 ("overlay filesystem")
2016-09-21 16:37:07 +02:00
Andreas Gruenbacher
0eb45fc3bb ovl: Switch to generic_getxattr
Now that overlayfs has xattr handlers for iop->{set,remove}xattr, use
those same handlers for iop->getxattr as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-01 11:12:00 +02:00
Andreas Gruenbacher
0e585ccc13 ovl: Switch to generic_removexattr
Commit d837a49bd5 ("ovl: fix POSIX ACL setting") switches from
iop->setxattr from ovl_setxattr to generic_setxattr, so switch from
ovl_removexattr to generic_removexattr as well.  As far as permission
checking goes, the same rules should apply in either case.

While doing that, rename ovl_setxattr to ovl_xattr_set to indicate that
this is not an iop->setxattr implementation and remove the unused inode
argument.

Move ovl_other_xattr_set above ovl_own_xattr_set so that they match the
order of handlers in ovl_xattr_handlers.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Fixes: d837a49bd5 ("ovl: fix POSIX ACL setting")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-01 11:12:00 +02:00
Miklos Szeredi
38b256973e ovl: handle umask and posix_acl_default correctly on creation
Setting MS_POSIXACL in sb->s_flags has the side effect of passing mode to
create functions without masking against umask.

Another problem when creating over a whiteout is that the default posix acl
is not inherited from the parent dir (because the real parent dir at the
time of creation is the work directory).

Fix these problems by:

 a) If upper fs does not have MS_POSIXACL, then mask mode with umask.

 b) If creating over a whiteout, call posix_acl_create() to get the
 inherited acls.  After creation (but before moving to the final
 destination) set these acls on the created file.  posix_acl_create() also
 updates the file creation mode as appropriate.

Fixes: 39a25b2b37 ("ovl: define ->get_acl() for overlay inodes")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-01 11:11:59 +02:00
Vivek Goyal
2602625b7e security, overlayfs: Provide hook to correctly label newly created files
During a new file creation we need to make sure new file is created with the
right label. New file is created in upper/ so effectively file should get
label as if task had created file in upper/.

We switched to mounter's creds for actual file creation. Also if there is a
whiteout present, then file will be created in work/ dir first and then
renamed in upper. In none of the cases file will be labeled as we want it to
be.

This patch introduces a new hook dentry_create_files_as(), which determines
the label/context dentry will get if it had been created by task in upper
and modify passed set of creds appropriately. Caller makes use of these new
creds for file creation.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
[PM: fix whitespace issues found with checkpatch.pl]
[PM: changes to use stat->mode in ovl_create_or_link()]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-08-08 20:46:46 -04:00
Miklos Szeredi
30c17ebfb2 ovl: simplify empty checking
The empty checking logic is duplicated in ovl_check_empty_and_clear() and
ovl_remove_and_whiteout(), except the condition for clearing whiteouts is
different:

ovl_check_empty_and_clear() checked for being upper

ovl_remove_and_whiteout() checked for merge OR lower

Move the intersection of those checks (upper AND merge) into
ovl_check_empty_and_clear() and simplify ovl_remove_and_whiteout().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-07-29 12:05:25 +02:00
Miklos Szeredi
dbc816d05d ovl: clear nlink on rmdir
To make delete notification work on fa/inotify.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-07-29 12:05:24 +02:00
Miklos Szeredi
d837a49bd5 ovl: fix POSIX ACL setting
Setting POSIX ACL needs special handling:

1) Some permission checks are done by ->setxattr() which now uses mounter's
creds ("ovl: do operations on underlying file system in mounter's
context").  These permission checks need to be done with current cred as
well.

2) Setting ACL can fail for various reasons.  We do not need to copy up in
these cases.

In the mean time switch to using generic_setxattr.

[Arnd Bergmann] Fix link error without POSIX ACL. posix_acl_from_xattr()
doesn't have a 'static inline' implementation when CONFIG_FS_POSIX_ACL is
disabled, and I could not come up with an obvious way to do it.

This instead avoids the link error by defining two sets of ACL operations
and letting the compiler drop one of the two at compile time depending
on CONFIG_FS_POSIX_ACL. This avoids all references to the ACL code,
also leading to smaller code.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-07-29 12:05:24 +02:00
Miklos Szeredi
51f7e52dc9 ovl: share inode for hard link
Inode attributes are copied up to overlay inode (uid, gid, mode, atime,
mtime, ctime) so generic code using these fields works correcty.  If a hard
link is created in overlayfs separate inodes are allocated for each link.
If chmod/chown/etc. is performed on one of the links then the inode
belonging to the other ones won't be updated.

This patch attempts to fix this by sharing inodes for hard links.

Use inode hash (with real inode pointer as a key) to make sure overlay
inodes are shared for hard links on upper.  Hard links on lower are still
split (which is not user observable until the copy-up happens, see
Documentation/filesystems/overlayfs.txt under "Non-standard behavior").

The inode is only inserted in the hash if it is non-directoy and upper.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-07-29 12:05:24 +02:00
Miklos Szeredi
39b681f802 ovl: store real inode pointer in ->i_private
To get from overlay inode to real inode we currently use 'struct
ovl_entry', which has lifetime connected to overlay dentry.  This is okay,
since each overlay dentry had a new overlay inode allocated.

Following patch will break that assumption, so need to leave out ovl_entry.
This patch stores the real inode directly in i_private, with the lowest bit
used to indicate whether the inode is upper or lower.

Lifetime rules remain, using ovl_inode_real() must only be done while
caller holds ref on overlay dentry (and hence on real dentry), or within
RCU protected regions.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-07-29 12:05:24 +02:00
Miklos Szeredi
d719e8f268 ovl: update atime on upper
Fix atime update logic in overlayfs.

This patch adds an i_op->update_time() handler to overlayfs inodes.  This
forwards atime updates to the upper layer only.  No atime updates are done
on lower layers.

Remove implicit atime updates to underlying files and directories with
O_NOATIME.  Remove explicit atime update in ovl_readlink().

Clear atime related mnt flags from cloned upper mount.  This means atime
updates are controlled purely by overlayfs mount options.

Reported-by: Konstantin Khlebnikov <koct9i@gmail.com> 
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-07-29 12:05:23 +02:00
Miklos Szeredi
bb0d2b8ad2 ovl: fix sgid on directory
When creating directory in workdir, the group/sgid inheritance from the
parent dir was omitted completely.  Fix this by calling inode_init_owner()
on overlay inode and using the resulting uid/gid/mode to create the file.

Unfortunately the sgid bit can be stripped off due to umask, so need to
reset the mode in this case in workdir before moving the directory in
place.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-07-29 12:05:23 +02:00
Vivek Goyal
1175b6b8d9 ovl: do operations on underlying file system in mounter's context
Given we are now doing checks both on overlay inode as well underlying
inode, we should be able to do checks and operations on underlying file
system using mounter's context.

So modify all operations to do checks/operations on underlying dentry/inode
in the context of mounter.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-07-29 12:05:23 +02:00
Vivek Goyal
39a25b2b37 ovl: define ->get_acl() for overlay inodes
Now we are planning to do DAC permission checks on overlay inode
itself. And to make it work, we will need to make sure we can get acls from
underlying inode. So define ->get_acl() for overlay inodes and this in turn
calls into underlying filesystem to get acls, if any.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-07-29 12:05:23 +02:00
Vivek Goyal
72e4848181 ovl: move some common code in a function
ovl_create_upper() and ovl_create_over_whiteout() seem to be sharing some
common code which can be moved into a separate function.  No functionality
change.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-07-29 12:05:23 +02:00
Maxim Patlasov
cfc9fde0b0 ovl: verify upper dentry in ovl_remove_and_whiteout()
The upper dentry may become stale before we call ovl_lock_rename_workdir.
For example, someone could (mistakenly or maliciously) manually unlink(2)
it directly from upperdir.

To ensure it is not stale, let's lookup it after ovl_lock_rename_workdir
and and check if it matches the upper dentry.

Essentially, it is the same problem and similar solution as in
commit 11f3710417 ("ovl: verify upper dentry before unlink and rename").

Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-07-22 10:54:20 +02:00
Miklos Szeredi
d0e13f5bbe ovl: fix uid/gid when creating over whiteout
Fix a regression when creating a file over a whiteout.  The new
file/directory needs to use the current fsuid/fsgid, not the ones from the
mounter's credentials.

The refcounting is a bit tricky: prepare_creds() sets an original refcount,
override_creds() gets one more, which revert_cred() drops.  So

  1) we need to expicitly put the mounter's credentials when overriding
     with the updated one

  2) we need to put the original ref to the updated creds (and this can
     safely be done before revert_creds(), since we'll still have the ref
     from override_creds()).

Reported-by: Stephen Smalley <sds@tycho.nsa.gov>
Fixes: 3fe6e52f06 ("ovl: override creds with the ones from the superblock mounter")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-06-15 14:18:59 +02:00
Antonio Murdaca
3fe6e52f06 ovl: override creds with the ones from the superblock mounter
In user namespace the whiteout creation fails with -EPERM because the
current process isn't capable(CAP_SYS_ADMIN) when setting xattr.

A simple reproducer:

$ mkdir upper lower work merged lower/dir
$ sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
$ unshare -m -p -f -U -r bash

Now as root in the user namespace:

\# touch merged/dir/{1,2,3} # this will force a copy up of lower/dir
\# rm -fR merged/*

This ends up failing with -EPERM after the files in dir has been
correctly deleted:

unlinkat(4, "2", 0)                     = 0
unlinkat(4, "1", 0)                     = 0
unlinkat(4, "3", 0)                     = 0
close(4)                                = 0
unlinkat(AT_FDCWD, "merged/dir", AT_REMOVEDIR) = -1 EPERM (Operation not
permitted)

Interestingly, if you don't place files in merged/dir you can remove it,
meaning if upper/dir does not exist, creating the char device file works
properly in that same location.

This patch uses ovl_sb_creator_cred() to get the cred struct from the
superblock mounter and override the old cred with these new ones so that
the whiteout creation is possible because overlay is wrong in assuming that
the creds it will get with prepare_creds will be in the initial user
namespace.  The old cap_raise game is removed in favor of just overriding
the old cred struct.

This patch also drops from ovl_copy_up_one() the following two lines:

override_cred->fsuid = stat->uid;
override_cred->fsgid = stat->gid;

This is because the correct uid and gid are taken directly with the stat
struct and correctly set with ovl_set_attr().

Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-05-27 08:55:26 +02:00
Miklos Szeredi
6986c012fa ovl: cleanup unused var in rename2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-03-21 17:31:46 +01:00
Miklos Szeredi
11f3710417 ovl: verify upper dentry before unlink and rename
Unlink and rename in overlayfs checked the upper dentry for staleness by
verifying upper->d_parent against upperdir.  However the dentry can go
stale also by being unhashed, for example.

Expand the verification to actually look up the name again (under parent
lock) and check if it matches the upper dentry.  This matches what the VFS
does before passing the dentry to filesytem's unlink/rename methods, which
excludes any inconsistency caused by overlayfs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-03-21 17:31:44 +01:00
Konstantin Khlebnikov
45d1173896 ovl: ignore lower entries when checking purity of non-directory entries
After rename file dentry still holds reference to lower dentry from
previous location. This doesn't matter for data access because data comes
from upper dentry. But this stale lower dentry taints dentry at new
location and turns it into non-pure upper. Such file leaves visible
whiteout entry after remove in directory which shouldn't have whiteouts at
all.

Overlayfs already tracks pureness of file location in oe->opaque.  This
patch just uses that for detecting actual path type.

Comment from Vivek Goyal's patch:

Here are the details of the problem. Do following.

$ mkdir upper lower work merged upper/dir/
$ touch lower/test
$ sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=
work merged
$ mv merged/test merged/dir/
$ rm merged/dir/test
$ ls -l merged/dir/
/usr/bin/ls: cannot access merged/dir/test: No such file or directory
total 0
c????????? ? ? ? ?            ? test

Basic problem seems to be that once a file has been unlinked, a whiteout
has been left behind which was not needed and hence it becomes visible.

Whiteout is visible because parent dir is of not type MERGE, hence
od->is_real is set during ovl_dir_open(). And that means ovl_iterate()
passes on iterate handling directly to underlying fs. Underlying fs does
not know/filter whiteouts so it becomes visible to user.

Why did we leave a whiteout to begin with when we should not have.
ovl_do_remove() checks for OVL_TYPE_PURE_UPPER() and does not leave
whiteout if file is pure upper. In this case file is not found to be pure
upper hence whiteout is left.

So why file was not PURE_UPPER in this case? I think because dentry is
still carrying some leftover state which was valid before rename. For
example, od->numlower was set to 1 as it was a lower file. After rename,
this state is not valid anymore as there is no such file in lower.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Viktor Stanchev <me@viktorstanchev.com>
Suggested-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=109611
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
2016-03-03 17:17:45 +01:00
Rui Wang
ce9113bbcb ovl: fix getcwd() failure after unsuccessful rmdir
ovl_remove_upper() should do d_drop() only after it successfully
removes the dir, otherwise a subsequent getcwd() system call will
fail, breaking userspace programs.

This is to fix: https://bugzilla.kernel.org/show_bug.cgi?id=110491

Signed-off-by: Rui Wang <rui.y.wang@intel.com>
Reviewed-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
2016-03-03 17:17:45 +01:00
Al Viro
5955102c99 wrappers for ->i_mutex access
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:04:28 -05:00
Miklos Szeredi
cc6f67bcaf ovl: mount read-only if workdir can't be created
OpenWRT folks reported that overlayfs fails to mount if upper fs is full,
because workdir can't be created.  Wordir creation can fail for various
other reasons too.

There's no reason that the mount itself should fail, overlayfs can work
fine without a workdir, as long as the overlay isn't modified.

So mount it read-only and don't allow remounting read-write.

Add a couple of WARN_ON()s for the impossible case of workdir being used
despite being read-only.

Reported-by: Bastian Bittorf <bittorf@bluebottle.com> 
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <stable@vger.kernel.org> # v3.18+
2015-05-19 14:30:12 +02:00
Miklos Szeredi
d377c5eb54 ovl: don't remove non-empty opaque directory
When removing an opaque directory we can't just call rmdir() to check for
emptiness, because the directory will need to be replaced with a whiteout.
The replacement is done with RENAME_EXCHANGE, which doesn't check
emptiness.

Solution is just to check emptiness by reading the directory.  In the
future we could add a new rename flag to check for emptiness even for
RENAME_EXCHANGE to optimize this case.

Reported-by: Vincent Batts <vbatts@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Jordi Pujol Palomer <jordipujolp@gmail.com>
Fixes: 263b4a0fee ("ovl: dont replace opaque dir")
Cc: <stable@vger.kernel.org> # v4.0+
2015-05-14 10:04:44 +02:00
David Howells
e36cb0b89c VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
Convert the following where appropriate:

 (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).

 (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).

 (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
     complicated than it appears as some calls should be converted to
     d_can_lookup() instead.  The difference is whether the directory in
     question is a real dir with a ->lookup op or whether it's a fake dir with
     a ->d_automount op.

In some circumstances, we can subsume checks for dentry->d_inode not being
NULL into this, provided we the code isn't in a filesystem that expects
d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
use d_inode() rather than d_backing_inode() to get the inode pointer).

Note that the dentry type field may be set to something other than
DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
manages the fall-through from a negative dentry to a lower layer.  In such a
case, the dentry type of the negative union dentry is set to the same as the
type of the lower dentry.

However, if you know d_inode is not NULL at the call site, then you can use
the d_is_xxx() functions even in a filesystem.

There is one further complication: a 0,0 chardev dentry may be labelled
DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
intended for special directory entry types that don't have attached inodes.

The following perl+coccinelle script was used:

use strict;

my @callers;
open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
    die "Can't grep for S_ISDIR and co. callers";
@callers = <$fd>;
close($fd);
unless (@callers) {
    print "No matches\n";
    exit(0);
}

my @cocci = (
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISLNK(E->d_inode->i_mode)',
    '+ d_is_symlink(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISDIR(E->d_inode->i_mode)',
    '+ d_is_dir(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISREG(E->d_inode->i_mode)',
    '+ d_is_reg(E)' );

my $coccifile = "tmp.sp.cocci";
open($fd, ">$coccifile") || die $coccifile;
print($fd "$_\n") || die $coccifile foreach (@cocci);
close($fd);

foreach my $file (@callers) {
    chomp $file;
    print "Processing ", $file, "\n";
    system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
	die "spatch failed";
}

[AV: overlayfs parts skipped]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:38:41 -05:00
hujianyang
cead89bb08 ovl: Use macros to present ovl_xattr
This patch adds two macros:

OVL_XATTR_PRE_NAME and OVL_XATTR_PRE_LEN

to present ovl_xattr name prefix and its length. Also, a
new macro OVL_XATTR_OPAQUE is introduced to replace old
*ovl_opaque_xattr*.

Fix the length of "trusted.overlay." to *16*.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:52 +01:00
Miklos Szeredi
263b4a0fee ovl: dont replace opaque dir
When removing an empty opaque directory, then it makes no sense to replace
it with an exact replica of itself before removal.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:43 +01:00
Miklos Szeredi
1afaba1ecb ovl: make path-type a bitmap
OVL_PATH_PURE_UPPER -> __OVL_PATH_UPPER | __OVL_PATH_PURE
OVL_PATH_UPPER      -> __OVL_PATH_UPPER
OVL_PATH_MERGE      -> __OVL_PATH_UPPER | __OVL_PATH_MERGE
OVL_PATH_LOWER      -> 0

Multiple R/O layers will allow __OVL_PATH_MERGE without __OVL_PATH_UPPER.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:42 +01:00
Miklos Szeredi
a105d685a8 ovl: fix remove/copy-up race
ovl_remove_and_whiteout() needs to check if upper dentry exists or not
after having locked upper parent directory.

Previously we used a "type" value computed before locking the upper parent
directory, which is susceptible to racing with copy-up.

There's a similar check in ovl_check_empty_and_clear().  This one is not
actually racy, since copy-up doesn't change the "emptyness" property of a
directory.  Add a comment to this effect, and check the existence of upper
dentry locally to make the code cleaner.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-11-20 16:39:59 +01:00
Miklos Szeredi
e9be9d5e76 overlay filesystem
Overlayfs allows one, usually read-write, directory tree to be
overlaid onto another, read-only directory tree.  All modifications
go to the upper, writable layer.

This type of mechanism is most often used for live CDs but there's a
wide variety of other uses.

The implementation differs from other "union filesystem"
implementations in that after a file is opened all operations go
directly to the underlying, lower or upper, filesystems.  This
simplifies the implementation and allows native performance in these
cases.

The dentry tree is duplicated from the underlying filesystems, this
enables fast cached lookups without adding special support into the
VFS.  This uses slightly more memory than union mounts, but dentries
are relatively small.

Currently inodes are duplicated as well, but it is a possible
optimization to share inodes for non-directories.

Opening non directories results in the open forwarded to the
underlying filesystem.  This makes the behavior very similar to union
mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file
descriptors).

Usage:

  mount -t overlayfs overlayfs -olowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work /overlay

The following cotributions have been folded into this patch:

Neil Brown <neilb@suse.de>:
 - minimal remount support
 - use correct seek function for directories
 - initialise is_real before use
 - rename ovl_fill_cache to ovl_dir_read

Felix Fietkau <nbd@openwrt.org>:
 - fix a deadlock in ovl_dir_read_merged
 - fix a deadlock in ovl_remove_whiteouts

Erez Zadok <ezk@fsl.cs.sunysb.edu>
 - fix cleanup after WARN_ON

Sedat Dilek <sedat.dilek@googlemail.com>
 - fix up permission to confirm to new API

Robin Dong <hao.bigrat@gmail.com>
 - fix possible leak in ovl_new_inode
 - create new inode in ovl_link

Andy Whitcroft <apw@canonical.com>
 - switch to __inode_permission()
 - copy up i_uid/i_gid from the underlying inode

AV:
 - ovl_copy_up_locked() - dput(ERR_PTR(...)) on two failure exits
 - ovl_clear_empty() - one failure exit forgetting to do unlock_rename(),
   lack of check for udir being the parent of upper, dropping and regaining
   the lock on udir (which would require _another_ check for parent being
   right).
 - bogus d_drop() in copyup and rename [fix from your mail]
 - copyup/remove and copyup/rename races [fix from your mail]
 - ovl_dir_fsync() leaving ERR_PTR() in ->realfile
 - ovl_entry_free() is pointless - it's just a kfree_rcu()
 - fold ovl_do_lookup() into ovl_lookup()
 - manually assigning ->d_op is wrong.  Just use ->s_d_op.
 [patches picked from Miklos]:
 * copyup/remove and copyup/rename races
 * bogus d_drop() in copyup and rename

Also thanks to the following people for testing and reporting bugs:

  Jordi Pujol <jordipujolp@gmail.com>
  Andy Whitcroft <apw@canonical.com>
  Michal Suchanek <hramrach@centrum.cz>
  Felix Fietkau <nbd@openwrt.org>
  Erez Zadok <ezk@fsl.cs.sunysb.edu>
  Randy Dunlap <rdunlap@xenotime.net>

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24 00:14:38 +02:00