linux-stable/fs/pnode.c
Christian Brauner 6ac3928156
fs: allow to mount beneath top mount
Various distributions are adding or are in the process of adding support
for system extensions and in the future configuration extensions through
various tools. A more detailed explanation on system and configuration
extensions can be found on the manpage which is listed below at [1].

System extension images may – dynamically at runtime — extend the /usr/
and /opt/ directory hierarchies with additional files. This is
particularly useful on immutable system images where a /usr/ and/or
/opt/ hierarchy residing on a read-only file system shall be extended
temporarily at runtime without making any persistent modifications.

When one or more system extension images are activated, their /usr/ and
/opt/ hierarchies are combined via overlayfs with the same hierarchies
of the host OS, and the host /usr/ and /opt/ overmounted with it
("merging"). When they are deactivated, the mount point is disassembled
— again revealing the unmodified original host version of the hierarchy
("unmerging"). Merging thus makes the extension's resources suddenly
appear below the /usr/ and /opt/ hierarchies as if they were included in
the base OS image itself. Unmerging makes them disappear again, leaving
in place only the files that were shipped with the base OS image itself.

System configuration images are similar but operate on directories
containing system or service configuration.

On nearly all modern distributions mount propagation plays a crucial
role and the rootfs of the OS is a shared mount in a peer group (usually
with peer group id 1):

       TARGET  SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /       /       ext4    shared:1     29      1

On such systems all services and containers run in a separate mount
namespace and are pivot_root()ed into their rootfs. A separate mount
namespace is almost always used as it is the minimal isolation mechanism
services have. But usually they are even much more isolated up to the
point where they almost become indistinguishable from containers.

Mount propagation again plays a crucial role here. The rootfs of all
these services is a slave mount to the peer group of the host rootfs.
This is done so the service will receive mount propagation events from
the host when certain files or directories are updated.

In addition, the rootfs of each service, container, and sandbox is also
a shared mount in its separate peer group:

       TARGET  SOURCE  FSTYPE  PROPAGATION         MNT_ID  PARENT_ID
       /       /       ext4    shared:24 master:1  71      47

For people not too familiar with mount propagation, the master:1 means
that this is a slave mount to peer group 1. Which as one can see is the
host rootfs as indicated by shared:1 above. The shared:24 indicates that
the service rootfs is a shared mount in a separate peer group with peer
group id 24.

A service may run other services. Such nested services will also have a
rootfs mount that is a slave to the peer group of the outer service
rootfs mount.

For containers things are just slighly different. A container's rootfs
isn't a slave to the service's or host rootfs' peer group. The rootfs
mount of a container is simply a shared mount in its own peer group:

       TARGET                    SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /home/ubuntu/debian-tree  /       ext4    shared:99    61      60

So whereas services are isolated OS components a container is treated
like a separate world and mount propagation into it is restricted to a
single well known mount that is a slave to the peer group of the shared
mount /run on the host:

       TARGET                  SOURCE              FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /propagate/debian-tree  /run/host/incoming  tmpfs   master:5     71      68

Here, the master:5 indicates that this mount is a slave to the peer
group with peer group id 5. This allows to propagate mounts into the
container and served as a workaround for not being able to insert mounts
into mount namespaces directly. But the new mount api does support
inserting mounts directly. For the interested reader the blogpost in [2]
might be worth reading where I explain the old and the new approach to
inserting mounts into mount namespaces.

Containers of course, can themselves be run as services. They often run
full systems themselves which means they again run services and
containers with the exact same propagation settings explained above.

The whole system is designed so that it can be easily updated, including
all services in various fine-grained ways without having to enter every
single service's mount namespace which would be prohibitively expensive.
The mount propagation layout has been carefully chosen so it is possible
to propagate updates for system extensions and configurations from the
host into all services.

The simplest model to update the whole system is to mount on top of
/usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
will then propagate into every service. This works cleanly the first
time. However, when the system is updated multiple times it becomes
necessary to unmount the first update on /opt, /usr, /etc and then
propagate the new update. But this means, there's an interval where the
old base system is accessible. This has to be avoided to protect against
downgrade attacks.

The vfs already exposes a mechanism to userspace whereby mounts can be
mounted beneath an existing mount. Such mounts are internally referred
to as "tucked". The patch series exposes the ability to mount beneath a
top mount through the new MOVE_MOUNT_BENEATH flag for the move_mount()
system call. This allows userspace to seamlessly upgrade mounts. After
this series the only thing that will have changed is that mounting
beneath an existing mount can be done explicitly instead of just
implicitly.

Today, there are two scenarios where a mount can be mounted beneath an
existing mount instead of on top of it:

(1) When a service or container is started in a new mount namespace and
    pivot_root()s into its new rootfs. The way this is done is by
    mounting the new rootfs beneath the old rootfs:

            fd_newroot = open("/var/lib/machines/fedora", ...);
            fd_oldroot = open("/", ...);
            fchdir(fd_newroot);
            pivot_root(".", ".");

    After the pivot_root(".", ".") call the new rootfs is mounted
    beneath the old rootfs which can then be unmounted to reveal the
    underlying mount:

            fchdir(fd_oldroot);
            umount2(".", MNT_DETACH);

    Since pivot_root() moves the caller into a new rootfs no mounts must
    be propagated out of the new rootfs as a consequence of the
    pivot_root() call. Thus, the mounts cannot be shared.

(2) When a mount is propagated to a mount that already has another mount
    mounted on the same dentry.

    The easiest example for this is to create a new mount namespace. The
    following commands will create a mount namespace where the rootfs
    mount / will be a slave to the peer group of the host rootfs /
    mount's peer group. IOW, it will receive propagation from the host:

            mount --make-shared /
            unshare --mount --propagation=slave

    Now a new mount on the /mnt dentry in that mount namespace is
    created. (As it can be confusing it should be spelled out that the
    tmpfs mount on the /mnt dentry that was just created doesn't
    propagate back to the host because the rootfs mount / of the mount
    namespace isn't a peer of the host rootfs.):

            mount -t tmpfs tmpfs /mnt

            TARGET  SOURCE  FSTYPE  PROPAGATION
            └─/mnt  tmpfs   tmpfs

    Now another terminal in the host mount namespace can observe that
    the mount indeed hasn't propagated back to into the host mount
    namespace. A new mount can now be created on top of the /mnt dentry
    with the rootfs mount / as its parent:

            mount --bind /opt /mnt

            TARGET  SOURCE           FSTYPE  PROPAGATION
            └─/mnt  /dev/sda2[/opt]  ext4    shared:1

    The mount namespace that was created earlier can now observe that
    the bind mount created on the host has propagated into it:

            TARGET    SOURCE           FSTYPE  PROPAGATION
            └─/mnt    /dev/sda2[/opt]  ext4    master:1
              └─/mnt  tmpfs            tmpfs

    But instead of having been mounted on top of the tmpfs mount at the
    /mnt dentry the /opt mount has been mounted on top of the rootfs
    mount at the /mnt dentry. And the tmpfs mount has been remounted on
    top of the propagated /opt mount at the /opt dentry. So in other
    words, the propagated mount has been mounted beneath the preexisting
    mount in that mount namespace.

    Mount namespaces make this easy to illustrate but it's also easy to
    mount beneath an existing mount in the same mount namespace
    (The following example assumes a shared rootfs mount / with peer
     group id 1):

            mount --bind /opt /opt

            TARGET   SOURCE          FSTYPE  MNT_ID  PARENT_ID  PROPAGATION
            └─/opt  /dev/sda2[/opt]  ext4    188     29         shared:1

    If another mount is mounted on top of the /opt mount at the /opt
    dentry:

            mount --bind /tmp /opt

    The following clunky mount tree will result:

            TARGET      SOURCE           FSTYPE  MNT_ID  PARENT_ID  PROPAGATION
            └─/opt      /dev/sda2[/tmp]  ext4    405      29        shared:1
              └─/opt    /dev/sda2[/opt]  ext4    188     405        shared:1
                └─/opt  /dev/sda2[/tmp]  ext4    404     188        shared:1

    The /tmp mount is mounted beneath the /opt mount and another copy is
    mounted on top of the /opt mount. This happens because the rootfs /
    and the /opt mount are shared mounts in the same peer group.

    When the new /tmp mount is supposed to be mounted at the /opt dentry
    then the /tmp mount first propagates to the root mount at the /opt
    dentry. But there already is the /opt mount mounted at the /opt
    dentry. So the old /opt mount at the /opt dentry will be mounted on
    top of the new /tmp mount at the /tmp dentry, i.e. @opt->mnt_parent
    is @tmp and @opt->mnt_mountpoint is /tmp (Note that @opt->mnt_root
    is /opt which is what shows up as /opt under SOURCE). So again, a
    mount will be mounted beneath a preexisting mount.

    (Fwiw, a few iterations of mount --bind /opt /opt in a loop on a
     shared rootfs is a good example of what could be referred to as
     mount explosion.)

The main point is that such mounts allows userspace to umount a top
mount and reveal an underlying mount. So for example, umounting the
tmpfs mount on /mnt that was created in example (1) using mount
namespaces reveals the /opt mount which was mounted beneath it.

In (2) where a mount was mounted beneath the top mount in the same mount
namespace unmounting the top mount would unmount both the top mount and
the mount beneath. In the process the original mount would be remounted
on top of the rootfs mount / at the /opt dentry again.

This again, is a result of mount propagation only this time it's umount
propagation. However, this can be avoided by simply making the parent
mount / of the @opt mount a private or slave mount. Then the top mount
and the original mount can be unmounted to reveal the mount beneath.

These two examples are fairly arcane and are merely added to make it
clear how mount propagation has effects on current and future features.

More common use-cases will just be things like:

        mount -t btrfs /dev/sdA /mnt
        mount -t xfs   /dev/sdB --beneath /mnt
        umount /mnt

after which we'll have updated from a btrfs filesystem to a xfs
filesystem without ever revealing the underlying mountpoint.

The crux is that the proposed mechanism already exists and that it is so
powerful as to cover cases where mounts are supposed to be updated with
new versions. Crucially, it offers an important flexibility. Namely that
updates to a system may either be forced or can be delayed and the
umount of the top mount be left to a service if it is a cooperative one.

This adds a new flag to move_mount() that allows to explicitly move a
beneath the top mount adhering to the following semantics:

* Mounts cannot be mounted beneath the rootfs. This restriction
  encompasses the rootfs but also chroots via chroot() and pivot_root().
  To mount a mount beneath the rootfs or a chroot, pivot_root() can be
  used as illustrated above.
* The source mount must be a private mount to force the kernel to
  allocate a new, unused peer group id. This isn't a required
  restriction but a voluntary one. It avoids repeating a semantical
  quirk that already exists today. If bind mounts which already have a
  peer group id are inserted into mount trees that have the same peer
  group id this can cause a lot of mount propagation events to be
  generated (For example, consider running mount --bind /opt /opt in a
  loop where the parent mount is a shared mount.).
* Avoid getting rid of the top mount in the kernel. Cooperative services
  need to be able to unmount the top mount themselves.
  This also avoids a good deal of additional complexity. The umount
  would have to be propagated which would be another rather expensive
  operation. So namespace_lock() and lock_mount_hash() would potentially
  have to be held for a long time for both a mount and umount
  propagation. That should be avoided.
* The path to mount beneath must be mounted and attached.
* The top mount and its parent must be in the caller's mount namespace
  and the caller must be able to mount in that mount namespace.
* The caller must be able to unmount the top mount to prove that they
  could reveal the underlying mount.
* The propagation tree is calculated based on the destination mount's
  parent mount and the destination mount's mountpoint on the parent
  mount. Of course, if the parent of the destination mount and the
  destination mount are shared mounts in the same peer group and the
  mountpoint of the new mount to be mounted is a subdir of their
  ->mnt_root then both will receive a mount of /opt. That's probably
  easier to understand with an example. Assuming a standard shared
  rootfs /:

          mount --bind /opt /opt
          mount --bind /tmp /opt

  will cause the same mount tree as:

          mount --bind /opt /opt
          mount --beneath /tmp /opt

  because both / and /opt are shared mounts/peers in the same peer
  group and the /opt dentry is a subdirectory of both the parent's and
  the child's ->mnt_root. If a mount tree like that is created it almost
  always is an accident or abuse of mount propagation. Realistically
  what most people probably mean in this scenarios is:

          mount --bind /opt /opt
          mount --make-private /opt
          mount --make-shared /opt

  This forces the allocation of a new separate peer group for the /opt
  mount. Aferwards a mount --bind or mount --beneath actually makes
  sense as the / and /opt mount belong to different peer groups. Before
  that it's likely just confusion about what the user wanted to achieve.
* Refuse MOVE_MOUNT_BENEATH if:
  (1) the @mnt_from has been overmounted in between path resolution and
      acquiring @namespace_sem when locking @mnt_to. This avoids the
      proliferation of shadow mounts.
  (2) if @to_mnt is moved to a different mountpoint while acquiring
      @namespace_sem to lock @to_mnt.
  (3) if @to_mnt is unmounted while acquiring @namespace_sem to lock
      @to_mnt.
  (4) if the parent of the target mount propagates to the target mount
      at the same mountpoint.
      This would mean mounting @mnt_from on @mnt_to->mnt_parent and then
      propagating a copy @c of @mnt_from onto @mnt_to. This defeats the
      whole purpose of mounting @mnt_from beneath @mnt_to.
  (5) if the parent mount @mnt_to->mnt_parent propagates to @mnt_from at
      the same mountpoint.
      If @mnt_to->mnt_parent propagates to @mnt_from this would mean
      propagating a copy @c of @mnt_from on top of @mnt_from. Afterwards
      @mnt_from would be mounted on top of @mnt_to->mnt_parent and
      @mnt_to would be unmounted from @mnt->mnt_parent and remounted on
      @mnt_from. But since @c is already mounted on @mnt_from, @mnt_to
      would ultimately be remounted on top of @c. Afterwards, @mnt_from
      would be covered by a copy @c of @mnt_from and @c would be covered
      by @mnt_from itself. This defeats the whole purpose of mounting
      @mnt_from beneath @mnt_to.
  Cases (1) to (3) are required as they deal with races that would cause
  bugs or unexpected behavior for users. Cases (4) and (5) refuse
  semantical quirks that would not be a bug but would cause weird mount
  trees to be created. While they can already be created via other means
  (mount --bind /opt /opt x n) there's no reason to repeat past mistakes
  in new features.

Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [1]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [2]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013

Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Message-Id: <20230202-fs-move-mount-replace-v4-4-98f3d80d7eaa@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-19 04:30:22 +02:00

640 lines
16 KiB
C

// SPDX-License-Identifier: GPL-2.0-only
/*
* linux/fs/pnode.c
*
* (C) Copyright IBM Corporation 2005.
* Author : Ram Pai (linuxram@us.ibm.com)
*/
#include <linux/mnt_namespace.h>
#include <linux/mount.h>
#include <linux/fs.h>
#include <linux/nsproxy.h>
#include <uapi/linux/mount.h>
#include "internal.h"
#include "pnode.h"
/* return the next shared peer mount of @p */
static inline struct mount *next_peer(struct mount *p)
{
return list_entry(p->mnt_share.next, struct mount, mnt_share);
}
static inline struct mount *first_slave(struct mount *p)
{
return list_entry(p->mnt_slave_list.next, struct mount, mnt_slave);
}
static inline struct mount *last_slave(struct mount *p)
{
return list_entry(p->mnt_slave_list.prev, struct mount, mnt_slave);
}
static inline struct mount *next_slave(struct mount *p)
{
return list_entry(p->mnt_slave.next, struct mount, mnt_slave);
}
static struct mount *get_peer_under_root(struct mount *mnt,
struct mnt_namespace *ns,
const struct path *root)
{
struct mount *m = mnt;
do {
/* Check the namespace first for optimization */
if (m->mnt_ns == ns && is_path_reachable(m, m->mnt.mnt_root, root))
return m;
m = next_peer(m);
} while (m != mnt);
return NULL;
}
/*
* Get ID of closest dominating peer group having a representative
* under the given root.
*
* Caller must hold namespace_sem
*/
int get_dominating_id(struct mount *mnt, const struct path *root)
{
struct mount *m;
for (m = mnt->mnt_master; m != NULL; m = m->mnt_master) {
struct mount *d = get_peer_under_root(m, mnt->mnt_ns, root);
if (d)
return d->mnt_group_id;
}
return 0;
}
static int do_make_slave(struct mount *mnt)
{
struct mount *master, *slave_mnt;
if (list_empty(&mnt->mnt_share)) {
if (IS_MNT_SHARED(mnt)) {
mnt_release_group_id(mnt);
CLEAR_MNT_SHARED(mnt);
}
master = mnt->mnt_master;
if (!master) {
struct list_head *p = &mnt->mnt_slave_list;
while (!list_empty(p)) {
slave_mnt = list_first_entry(p,
struct mount, mnt_slave);
list_del_init(&slave_mnt->mnt_slave);
slave_mnt->mnt_master = NULL;
}
return 0;
}
} else {
struct mount *m;
/*
* slave 'mnt' to a peer mount that has the
* same root dentry. If none is available then
* slave it to anything that is available.
*/
for (m = master = next_peer(mnt); m != mnt; m = next_peer(m)) {
if (m->mnt.mnt_root == mnt->mnt.mnt_root) {
master = m;
break;
}
}
list_del_init(&mnt->mnt_share);
mnt->mnt_group_id = 0;
CLEAR_MNT_SHARED(mnt);
}
list_for_each_entry(slave_mnt, &mnt->mnt_slave_list, mnt_slave)
slave_mnt->mnt_master = master;
list_move(&mnt->mnt_slave, &master->mnt_slave_list);
list_splice(&mnt->mnt_slave_list, master->mnt_slave_list.prev);
INIT_LIST_HEAD(&mnt->mnt_slave_list);
mnt->mnt_master = master;
return 0;
}
/*
* vfsmount lock must be held for write
*/
void change_mnt_propagation(struct mount *mnt, int type)
{
if (type == MS_SHARED) {
set_mnt_shared(mnt);
return;
}
do_make_slave(mnt);
if (type != MS_SLAVE) {
list_del_init(&mnt->mnt_slave);
mnt->mnt_master = NULL;
if (type == MS_UNBINDABLE)
mnt->mnt.mnt_flags |= MNT_UNBINDABLE;
else
mnt->mnt.mnt_flags &= ~MNT_UNBINDABLE;
}
}
/*
* get the next mount in the propagation tree.
* @m: the mount seen last
* @origin: the original mount from where the tree walk initiated
*
* Note that peer groups form contiguous segments of slave lists.
* We rely on that in get_source() to be able to find out if
* vfsmount found while iterating with propagation_next() is
* a peer of one we'd found earlier.
*/
static struct mount *propagation_next(struct mount *m,
struct mount *origin)
{
/* are there any slaves of this mount? */
if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
return first_slave(m);
while (1) {
struct mount *master = m->mnt_master;
if (master == origin->mnt_master) {
struct mount *next = next_peer(m);
return (next == origin) ? NULL : next;
} else if (m->mnt_slave.next != &master->mnt_slave_list)
return next_slave(m);
/* back at master */
m = master;
}
}
static struct mount *skip_propagation_subtree(struct mount *m,
struct mount *origin)
{
/*
* Advance m such that propagation_next will not return
* the slaves of m.
*/
if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
m = last_slave(m);
return m;
}
static struct mount *next_group(struct mount *m, struct mount *origin)
{
while (1) {
while (1) {
struct mount *next;
if (!IS_MNT_NEW(m) && !list_empty(&m->mnt_slave_list))
return first_slave(m);
next = next_peer(m);
if (m->mnt_group_id == origin->mnt_group_id) {
if (next == origin)
return NULL;
} else if (m->mnt_slave.next != &next->mnt_slave)
break;
m = next;
}
/* m is the last peer */
while (1) {
struct mount *master = m->mnt_master;
if (m->mnt_slave.next != &master->mnt_slave_list)
return next_slave(m);
m = next_peer(master);
if (master->mnt_group_id == origin->mnt_group_id)
break;
if (master->mnt_slave.next == &m->mnt_slave)
break;
m = master;
}
if (m == origin)
return NULL;
}
}
/* all accesses are serialized by namespace_sem */
static struct mount *last_dest, *first_source, *last_source, *dest_master;
static struct hlist_head *list;
static inline bool peers(const struct mount *m1, const struct mount *m2)
{
return m1->mnt_group_id == m2->mnt_group_id && m1->mnt_group_id;
}
static int propagate_one(struct mount *m, struct mountpoint *dest_mp)
{
struct mount *child;
int type;
/* skip ones added by this propagate_mnt() */
if (IS_MNT_NEW(m))
return 0;
/* skip if mountpoint isn't covered by it */
if (!is_subdir(dest_mp->m_dentry, m->mnt.mnt_root))
return 0;
if (peers(m, last_dest)) {
type = CL_MAKE_SHARED;
} else {
struct mount *n, *p;
bool done;
for (n = m; ; n = p) {
p = n->mnt_master;
if (p == dest_master || IS_MNT_MARKED(p))
break;
}
do {
struct mount *parent = last_source->mnt_parent;
if (peers(last_source, first_source))
break;
done = parent->mnt_master == p;
if (done && peers(n, parent))
break;
last_source = last_source->mnt_master;
} while (!done);
type = CL_SLAVE;
/* beginning of peer group among the slaves? */
if (IS_MNT_SHARED(m))
type |= CL_MAKE_SHARED;
}
child = copy_tree(last_source, last_source->mnt.mnt_root, type);
if (IS_ERR(child))
return PTR_ERR(child);
read_seqlock_excl(&mount_lock);
mnt_set_mountpoint(m, dest_mp, child);
if (m->mnt_master != dest_master)
SET_MNT_MARK(m->mnt_master);
read_sequnlock_excl(&mount_lock);
last_dest = m;
last_source = child;
hlist_add_head(&child->mnt_hash, list);
return count_mounts(m->mnt_ns, child);
}
/*
* mount 'source_mnt' under the destination 'dest_mnt' at
* dentry 'dest_dentry'. And propagate that mount to
* all the peer and slave mounts of 'dest_mnt'.
* Link all the new mounts into a propagation tree headed at
* source_mnt. Also link all the new mounts using ->mnt_list
* headed at source_mnt's ->mnt_list
*
* @dest_mnt: destination mount.
* @dest_dentry: destination dentry.
* @source_mnt: source mount.
* @tree_list : list of heads of trees to be attached.
*/
int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
struct mount *source_mnt, struct hlist_head *tree_list)
{
struct mount *m, *n;
int ret = 0;
/*
* we don't want to bother passing tons of arguments to
* propagate_one(); everything is serialized by namespace_sem,
* so globals will do just fine.
*/
last_dest = dest_mnt;
first_source = source_mnt;
last_source = source_mnt;
list = tree_list;
dest_master = dest_mnt->mnt_master;
/* all peers of dest_mnt, except dest_mnt itself */
for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) {
ret = propagate_one(n, dest_mp);
if (ret)
goto out;
}
/* all slave groups */
for (m = next_group(dest_mnt, dest_mnt); m;
m = next_group(m, dest_mnt)) {
/* everything in that slave group */
n = m;
do {
ret = propagate_one(n, dest_mp);
if (ret)
goto out;
n = next_peer(n);
} while (n != m);
}
out:
read_seqlock_excl(&mount_lock);
hlist_for_each_entry(n, tree_list, mnt_hash) {
m = n->mnt_parent;
if (m->mnt_master != dest_mnt->mnt_master)
CLEAR_MNT_MARK(m->mnt_master);
}
read_sequnlock_excl(&mount_lock);
return ret;
}
static struct mount *find_topper(struct mount *mnt)
{
/* If there is exactly one mount covering mnt completely return it. */
struct mount *child;
if (!list_is_singular(&mnt->mnt_mounts))
return NULL;
child = list_first_entry(&mnt->mnt_mounts, struct mount, mnt_child);
if (child->mnt_mountpoint != mnt->mnt.mnt_root)
return NULL;
return child;
}
/*
* return true if the refcount is greater than count
*/
static inline int do_refcount_check(struct mount *mnt, int count)
{
return mnt_get_count(mnt) > count;
}
/**
* propagation_would_overmount - check whether propagation from @from
* would overmount @to
* @from: shared mount
* @to: mount to check
* @mp: future mountpoint of @to on @from
*
* If @from propagates mounts to @to, @from and @to must either be peers
* or one of the masters in the hierarchy of masters of @to must be a
* peer of @from.
*
* If the root of the @to mount is equal to the future mountpoint @mp of
* the @to mount on @from then @to will be overmounted by whatever is
* propagated to it.
*
* Context: This function expects namespace_lock() to be held and that
* @mp is stable.
* Return: If @from overmounts @to, true is returned, false if not.
*/
bool propagation_would_overmount(const struct mount *from,
const struct mount *to,
const struct mountpoint *mp)
{
if (!IS_MNT_SHARED(from))
return false;
if (IS_MNT_NEW(to))
return false;
if (to->mnt.mnt_root != mp->m_dentry)
return false;
for (const struct mount *m = to; m; m = m->mnt_master) {
if (peers(from, m))
return true;
}
return false;
}
/*
* check if the mount 'mnt' can be unmounted successfully.
* @mnt: the mount to be checked for unmount
* NOTE: unmounting 'mnt' would naturally propagate to all
* other mounts its parent propagates to.
* Check if any of these mounts that **do not have submounts**
* have more references than 'refcnt'. If so return busy.
*
* vfsmount lock must be held for write
*/
int propagate_mount_busy(struct mount *mnt, int refcnt)
{
struct mount *m, *child, *topper;
struct mount *parent = mnt->mnt_parent;
if (mnt == parent)
return do_refcount_check(mnt, refcnt);
/*
* quickly check if the current mount can be unmounted.
* If not, we don't have to go checking for all other
* mounts
*/
if (!list_empty(&mnt->mnt_mounts) || do_refcount_check(mnt, refcnt))
return 1;
for (m = propagation_next(parent, parent); m;
m = propagation_next(m, parent)) {
int count = 1;
child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
if (!child)
continue;
/* Is there exactly one mount on the child that covers
* it completely whose reference should be ignored?
*/
topper = find_topper(child);
if (topper)
count += 1;
else if (!list_empty(&child->mnt_mounts))
continue;
if (do_refcount_check(child, count))
return 1;
}
return 0;
}
/*
* Clear MNT_LOCKED when it can be shown to be safe.
*
* mount_lock lock must be held for write
*/
void propagate_mount_unlock(struct mount *mnt)
{
struct mount *parent = mnt->mnt_parent;
struct mount *m, *child;
BUG_ON(parent == mnt);
for (m = propagation_next(parent, parent); m;
m = propagation_next(m, parent)) {
child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
if (child)
child->mnt.mnt_flags &= ~MNT_LOCKED;
}
}
static void umount_one(struct mount *mnt, struct list_head *to_umount)
{
CLEAR_MNT_MARK(mnt);
mnt->mnt.mnt_flags |= MNT_UMOUNT;
list_del_init(&mnt->mnt_child);
list_del_init(&mnt->mnt_umounting);
list_move_tail(&mnt->mnt_list, to_umount);
}
/*
* NOTE: unmounting 'mnt' naturally propagates to all other mounts its
* parent propagates to.
*/
static bool __propagate_umount(struct mount *mnt,
struct list_head *to_umount,
struct list_head *to_restore)
{
bool progress = false;
struct mount *child;
/*
* The state of the parent won't change if this mount is
* already unmounted or marked as without children.
*/
if (mnt->mnt.mnt_flags & (MNT_UMOUNT | MNT_MARKED))
goto out;
/* Verify topper is the only grandchild that has not been
* speculatively unmounted.
*/
list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
if (child->mnt_mountpoint == mnt->mnt.mnt_root)
continue;
if (!list_empty(&child->mnt_umounting) && IS_MNT_MARKED(child))
continue;
/* Found a mounted child */
goto children;
}
/* Mark mounts that can be unmounted if not locked */
SET_MNT_MARK(mnt);
progress = true;
/* If a mount is without children and not locked umount it. */
if (!IS_MNT_LOCKED(mnt)) {
umount_one(mnt, to_umount);
} else {
children:
list_move_tail(&mnt->mnt_umounting, to_restore);
}
out:
return progress;
}
static void umount_list(struct list_head *to_umount,
struct list_head *to_restore)
{
struct mount *mnt, *child, *tmp;
list_for_each_entry(mnt, to_umount, mnt_list) {
list_for_each_entry_safe(child, tmp, &mnt->mnt_mounts, mnt_child) {
/* topper? */
if (child->mnt_mountpoint == mnt->mnt.mnt_root)
list_move_tail(&child->mnt_umounting, to_restore);
else
umount_one(child, to_umount);
}
}
}
static void restore_mounts(struct list_head *to_restore)
{
/* Restore mounts to a clean working state */
while (!list_empty(to_restore)) {
struct mount *mnt, *parent;
struct mountpoint *mp;
mnt = list_first_entry(to_restore, struct mount, mnt_umounting);
CLEAR_MNT_MARK(mnt);
list_del_init(&mnt->mnt_umounting);
/* Should this mount be reparented? */
mp = mnt->mnt_mp;
parent = mnt->mnt_parent;
while (parent->mnt.mnt_flags & MNT_UMOUNT) {
mp = parent->mnt_mp;
parent = parent->mnt_parent;
}
if (parent != mnt->mnt_parent)
mnt_change_mountpoint(parent, mp, mnt);
}
}
static void cleanup_umount_visitations(struct list_head *visited)
{
while (!list_empty(visited)) {
struct mount *mnt =
list_first_entry(visited, struct mount, mnt_umounting);
list_del_init(&mnt->mnt_umounting);
}
}
/*
* collect all mounts that receive propagation from the mount in @list,
* and return these additional mounts in the same list.
* @list: the list of mounts to be unmounted.
*
* vfsmount lock must be held for write
*/
int propagate_umount(struct list_head *list)
{
struct mount *mnt;
LIST_HEAD(to_restore);
LIST_HEAD(to_umount);
LIST_HEAD(visited);
/* Find candidates for unmounting */
list_for_each_entry_reverse(mnt, list, mnt_list) {
struct mount *parent = mnt->mnt_parent;
struct mount *m;
/*
* If this mount has already been visited it is known that it's
* entire peer group and all of their slaves in the propagation
* tree for the mountpoint has already been visited and there is
* no need to visit them again.
*/
if (!list_empty(&mnt->mnt_umounting))
continue;
list_add_tail(&mnt->mnt_umounting, &visited);
for (m = propagation_next(parent, parent); m;
m = propagation_next(m, parent)) {
struct mount *child = __lookup_mnt(&m->mnt,
mnt->mnt_mountpoint);
if (!child)
continue;
if (!list_empty(&child->mnt_umounting)) {
/*
* If the child has already been visited it is
* know that it's entire peer group and all of
* their slaves in the propgation tree for the
* mountpoint has already been visited and there
* is no need to visit this subtree again.
*/
m = skip_propagation_subtree(m, parent);
continue;
} else if (child->mnt.mnt_flags & MNT_UMOUNT) {
/*
* We have come accross an partially unmounted
* mount in list that has not been visited yet.
* Remember it has been visited and continue
* about our merry way.
*/
list_add_tail(&child->mnt_umounting, &visited);
continue;
}
/* Check the child and parents while progress is made */
while (__propagate_umount(child,
&to_umount, &to_restore)) {
/* Is the parent a umount candidate? */
child = child->mnt_parent;
if (list_empty(&child->mnt_umounting))
break;
}
}
}
umount_list(&to_umount, &to_restore);
restore_mounts(&to_restore);
cleanup_umount_visitations(&visited);
list_splice_tail(&to_umount, list);
return 0;
}