linux-stable/include/uapi/linux/inotify.h
Kirill Tkhai e1603b6eff inotify: Extend ioctl to allow to request id of new watch descriptor
Watch descriptor is id of the watch created by inotify_add_watch().
It is allocated in inotify_add_to_idr(), and takes the numbers
starting from 1. Every new inotify watch obtains next available
number (usually, old + 1), as served by idr_alloc_cyclic().

CRIU (Checkpoint/Restore In Userspace) project supports inotify
files, and restores watched descriptors with the same numbers,
they had before dump. Since there was no kernel support, we
had to use cycle to add a watch with specific descriptor id:

	while (1) {
		int wd;

		wd = inotify_add_watch(inotify_fd, path, mask);
		if (wd < 0) {
			break;
		} else if (wd == desired_wd_id) {
			ret = 0;
			break;
		}

		inotify_rm_watch(inotify_fd, wd);
	}

(You may find the actual code at the below link:
 https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577)

The cycle is suboptiomal and very expensive, but since there is no better
kernel support, it was the only way to restore that. Happily, we had met
mostly descriptors with small id, and this approach had worked somehow.

But recent time containers with inotify with big watch descriptors
begun to come, and this way stopped to work at all. When descriptor id
is something about 0x34d71d6, the restoring process spins in busy loop
for a long time, and the restore hungs and delay of migration from node
to node could easily be watched.

This patch aims to solve this problem. It introduces new ioctl
INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created
watch descriptor from userspace. It simply calls idr_set_cursor() primitive
to populate idr::idr_next, so that next idr_alloc_cyclic() allocation
will return this id, if it is not occupied. This is the way which is
used to restore some other resources from userspace. For example,
/proc/sys/kernel/ns_last_pid works the same for task pids.

The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system
may exclude it.

v2: Use INT_MAX instead of custom definition of max id,
as IDR subsystem guarantees id is between 0 and INT_MAX.

CC: Jan Kara <jack@suse.cz>
CC: Matthew Wilcox <willy@infradead.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-14 11:16:28 +01:00

83 lines
3.2 KiB
C

/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
/*
* Inode based directory notification for Linux
*
* Copyright (C) 2005 John McCutchan
*/
#ifndef _UAPI_LINUX_INOTIFY_H
#define _UAPI_LINUX_INOTIFY_H
/* For O_CLOEXEC and O_NONBLOCK */
#include <linux/fcntl.h>
#include <linux/types.h>
/*
* struct inotify_event - structure read from the inotify device for each event
*
* When you are watching a directory, you will receive the filename for events
* such as IN_CREATE, IN_DELETE, IN_OPEN, IN_CLOSE, ..., relative to the wd.
*/
struct inotify_event {
__s32 wd; /* watch descriptor */
__u32 mask; /* watch mask */
__u32 cookie; /* cookie to synchronize two events */
__u32 len; /* length (including nulls) of name */
char name[0]; /* stub for possible name */
};
/* the following are legal, implemented events that user-space can watch for */
#define IN_ACCESS 0x00000001 /* File was accessed */
#define IN_MODIFY 0x00000002 /* File was modified */
#define IN_ATTRIB 0x00000004 /* Metadata changed */
#define IN_CLOSE_WRITE 0x00000008 /* Writtable file was closed */
#define IN_CLOSE_NOWRITE 0x00000010 /* Unwrittable file closed */
#define IN_OPEN 0x00000020 /* File was opened */
#define IN_MOVED_FROM 0x00000040 /* File was moved from X */
#define IN_MOVED_TO 0x00000080 /* File was moved to Y */
#define IN_CREATE 0x00000100 /* Subfile was created */
#define IN_DELETE 0x00000200 /* Subfile was deleted */
#define IN_DELETE_SELF 0x00000400 /* Self was deleted */
#define IN_MOVE_SELF 0x00000800 /* Self was moved */
/* the following are legal events. they are sent as needed to any watch */
#define IN_UNMOUNT 0x00002000 /* Backing fs was unmounted */
#define IN_Q_OVERFLOW 0x00004000 /* Event queued overflowed */
#define IN_IGNORED 0x00008000 /* File was ignored */
/* helper events */
#define IN_CLOSE (IN_CLOSE_WRITE | IN_CLOSE_NOWRITE) /* close */
#define IN_MOVE (IN_MOVED_FROM | IN_MOVED_TO) /* moves */
/* special flags */
#define IN_ONLYDIR 0x01000000 /* only watch the path if it is a directory */
#define IN_DONT_FOLLOW 0x02000000 /* don't follow a sym link */
#define IN_EXCL_UNLINK 0x04000000 /* exclude events on unlinked objects */
#define IN_MASK_ADD 0x20000000 /* add to the mask of an already existing watch */
#define IN_ISDIR 0x40000000 /* event occurred against dir */
#define IN_ONESHOT 0x80000000 /* only send event once */
/*
* All of the events - we build the list by hand so that we can add flags in
* the future and not break backward compatibility. Apps will get only the
* events that they originally wanted. Be sure to add new events here!
*/
#define IN_ALL_EVENTS (IN_ACCESS | IN_MODIFY | IN_ATTRIB | IN_CLOSE_WRITE | \
IN_CLOSE_NOWRITE | IN_OPEN | IN_MOVED_FROM | \
IN_MOVED_TO | IN_DELETE | IN_CREATE | IN_DELETE_SELF | \
IN_MOVE_SELF)
/* Flags for sys_inotify_init1. */
#define IN_CLOEXEC O_CLOEXEC
#define IN_NONBLOCK O_NONBLOCK
/*
* ioctl numbers: inotify uses 'I' prefix for all ioctls,
* except historical FIONREAD, which is based on 'T'.
*
* INOTIFY_IOC_SETNEXTWD: set desired number of next created
* watch descriptor.
*/
#define INOTIFY_IOC_SETNEXTWD _IOW('I', 0, __s32)
#endif /* _UAPI_LINUX_INOTIFY_H */