linux-stable/include/uapi/linux/fuse.h

1190 lines
28 KiB
C
Raw Normal View History

License cleanup: add SPDX license identifier to uapi header files with a license Many user space API headers have licensing information, which is either incomplete, badly formatted or just a shorthand for referring to the license under which the file is supposed to be. This makes it hard for compliance tools to determine the correct license. Update these files with an SPDX license identifier. The identifier was chosen based on the license information in the file. GPL/LGPL licensed headers get the matching GPL/LGPL SPDX license identifier with the added 'WITH Linux-syscall-note' exception, which is the officially assigned exception identifier for the kernel syscall exception: NOTE! This copyright does *not* cover user programs that use kernel services by normal system calls - this is merely considered normal use of the kernel, and does *not* fall under the heading of "derived work". This exception makes it possible to include GPL headers into non GPL code, without confusing license compliance tools. Headers which have either explicit dual licensing or are just licensed under a non GPL license are updated with the corresponding SPDX identifier and the GPLv2 with syscall exception identifier. The format is: ((GPL-2.0 WITH Linux-syscall-note) OR SPDX-ID-OF-OTHER-LICENSE) SPDX license identifiers are a legally binding shorthand, which can be used instead of the full boiler plate text. The update does not remove existing license information as this has to be done on a case by case basis and the copyright holders might have to be consulted. This will happen in a separate step. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. See the previous patch in this series for the methodology of how this patch was researched. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 14:09:13 +00:00
/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
[PATCH] FUSE - core This patch adds FUSE core. This contains the following files: o inode.c - superblock operations (alloc_inode, destroy_inode, read_inode, clear_inode, put_super, show_options) - registers FUSE filesystem o fuse_i.h - private header file Requirements ============ The most important difference between orinary filesystems and FUSE is the fact, that the filesystem data/metadata is provided by a userspace process run with the privileges of the mount "owner" instead of the kernel, or some remote entity usually running with elevated privileges. The security implication of this is that a non-privileged user must not be able to use this capability to compromise the system. Obvious requirements arising from this are: - mount owner should not be able to get elevated privileges with the help of the mounted filesystem - mount owner should not be able to induce undesired behavior in other users' or the super user's processes - mount owner should not get illegitimate access to information from other users' and the super user's processes These are currently ensured with the following constraints: 1) mount is only allowed to directory or file which the mount owner can modify without limitation (write access + no sticky bit for directories) 2) nosuid,nodev mount options are forced 3) any process running with fsuid different from the owner is denied all access to the filesystem 1) and 2) are ensured by the "fusermount" mount utility which is a setuid root application doing the actual mount operation. 3) is ensured by a check in the permission() method in kernel I started thinking about doing 3) in a different way because Christoph H. made a big deal out of it, saying that FUSE is unacceptable into mainline in this form. The suggested use of private namespaces would be OK, but in their current form have many limitations that make their use impractical (as discussed in this thread). Suggested improvements that would address these limitations: - implement shared subtrees - allow a process to join an existing namespace (make namespaces first-class objects) - implement the namespace creation/joining in a PAM module With all that in place the check of owner against current->fsuid may be removed from the FUSE kernel module, without compromising the security requirements. Suid programs still interesting questions, since they get access even to the private namespace causing some information leak (exact order/timing of filesystem operations performed), giving some ptrace-like capabilities to unprivileged users. BTW this problem is not strictly limited to the namespace approach, since suid programs setting fsuid and accessing users' files will succeed with the current approach too. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 20:10:26 +00:00
/*
This file defines the kernel interface of FUSE
Copyright (C) 2001-2008 Miklos Szeredi <miklos@szeredi.hu>
[PATCH] FUSE - core This patch adds FUSE core. This contains the following files: o inode.c - superblock operations (alloc_inode, destroy_inode, read_inode, clear_inode, put_super, show_options) - registers FUSE filesystem o fuse_i.h - private header file Requirements ============ The most important difference between orinary filesystems and FUSE is the fact, that the filesystem data/metadata is provided by a userspace process run with the privileges of the mount "owner" instead of the kernel, or some remote entity usually running with elevated privileges. The security implication of this is that a non-privileged user must not be able to use this capability to compromise the system. Obvious requirements arising from this are: - mount owner should not be able to get elevated privileges with the help of the mounted filesystem - mount owner should not be able to induce undesired behavior in other users' or the super user's processes - mount owner should not get illegitimate access to information from other users' and the super user's processes These are currently ensured with the following constraints: 1) mount is only allowed to directory or file which the mount owner can modify without limitation (write access + no sticky bit for directories) 2) nosuid,nodev mount options are forced 3) any process running with fsuid different from the owner is denied all access to the filesystem 1) and 2) are ensured by the "fusermount" mount utility which is a setuid root application doing the actual mount operation. 3) is ensured by a check in the permission() method in kernel I started thinking about doing 3) in a different way because Christoph H. made a big deal out of it, saying that FUSE is unacceptable into mainline in this form. The suggested use of private namespaces would be OK, but in their current form have many limitations that make their use impractical (as discussed in this thread). Suggested improvements that would address these limitations: - implement shared subtrees - allow a process to join an existing namespace (make namespaces first-class objects) - implement the namespace creation/joining in a PAM module With all that in place the check of owner against current->fsuid may be removed from the FUSE kernel module, without compromising the security requirements. Suid programs still interesting questions, since they get access even to the private namespace causing some information leak (exact order/timing of filesystem operations performed), giving some ptrace-like capabilities to unprivileged users. BTW this problem is not strictly limited to the namespace approach, since suid programs setting fsuid and accessing users' files will succeed with the current approach too. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 20:10:26 +00:00
This program can be distributed under the terms of the GNU GPL.
See the file COPYING.
This -- and only this -- header file may also be distributed under
the terms of the BSD Licence as follows:
Copyright (C) 2001-2007 Miklos Szeredi. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.
[PATCH] FUSE - core This patch adds FUSE core. This contains the following files: o inode.c - superblock operations (alloc_inode, destroy_inode, read_inode, clear_inode, put_super, show_options) - registers FUSE filesystem o fuse_i.h - private header file Requirements ============ The most important difference between orinary filesystems and FUSE is the fact, that the filesystem data/metadata is provided by a userspace process run with the privileges of the mount "owner" instead of the kernel, or some remote entity usually running with elevated privileges. The security implication of this is that a non-privileged user must not be able to use this capability to compromise the system. Obvious requirements arising from this are: - mount owner should not be able to get elevated privileges with the help of the mounted filesystem - mount owner should not be able to induce undesired behavior in other users' or the super user's processes - mount owner should not get illegitimate access to information from other users' and the super user's processes These are currently ensured with the following constraints: 1) mount is only allowed to directory or file which the mount owner can modify without limitation (write access + no sticky bit for directories) 2) nosuid,nodev mount options are forced 3) any process running with fsuid different from the owner is denied all access to the filesystem 1) and 2) are ensured by the "fusermount" mount utility which is a setuid root application doing the actual mount operation. 3) is ensured by a check in the permission() method in kernel I started thinking about doing 3) in a different way because Christoph H. made a big deal out of it, saying that FUSE is unacceptable into mainline in this form. The suggested use of private namespaces would be OK, but in their current form have many limitations that make their use impractical (as discussed in this thread). Suggested improvements that would address these limitations: - implement shared subtrees - allow a process to join an existing namespace (make namespaces first-class objects) - implement the namespace creation/joining in a PAM module With all that in place the check of owner against current->fsuid may be removed from the FUSE kernel module, without compromising the security requirements. Suid programs still interesting questions, since they get access even to the private namespace causing some information leak (exact order/timing of filesystem operations performed), giving some ptrace-like capabilities to unprivileged users. BTW this problem is not strictly limited to the namespace approach, since suid programs setting fsuid and accessing users' files will succeed with the current approach too. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 20:10:26 +00:00
*/
/*
* This file defines the kernel interface of FUSE
*
* Protocol changelog:
*
* 7.1:
* - add the following messages:
* FUSE_SETATTR, FUSE_SYMLINK, FUSE_MKNOD, FUSE_MKDIR, FUSE_UNLINK,
* FUSE_RMDIR, FUSE_RENAME, FUSE_LINK, FUSE_OPEN, FUSE_READ, FUSE_WRITE,
* FUSE_RELEASE, FUSE_FSYNC, FUSE_FLUSH, FUSE_SETXATTR, FUSE_GETXATTR,
* FUSE_LISTXATTR, FUSE_REMOVEXATTR, FUSE_OPENDIR, FUSE_READDIR,
* FUSE_RELEASEDIR
* - add padding to messages to accommodate 32-bit servers on 64-bit kernels
*
* 7.2:
* - add FOPEN_DIRECT_IO and FOPEN_KEEP_CACHE flags
* - add FUSE_FSYNCDIR message
*
* 7.3:
* - add FUSE_ACCESS message
* - add FUSE_CREATE message
* - add filehandle to fuse_setattr_in
*
* 7.4:
* - add frsize to fuse_kstatfs
* - clean up request size limit checking
*
* 7.5:
* - add flags and max_write to fuse_init_out
*
* 7.6:
* - add max_readahead to fuse_init_in and fuse_init_out
*
* 7.7:
* - add FUSE_INTERRUPT message
* - add POSIX file lock support
*
* 7.8:
* - add lock_owner and flags fields to fuse_release_in
* - add FUSE_BMAP message
* - add FUSE_DESTROY message
*
* 7.9:
* - new fuse_getattr_in input argument of GETATTR
* - add lk_flags in fuse_lk_in
* - add lock_owner field to fuse_setattr_in, fuse_read_in and fuse_write_in
* - add blksize field to fuse_attr
* - add file flags field to fuse_read_in and fuse_write_in
* - Add ATIME_NOW and MTIME_NOW flags to fuse_setattr_in
*
* 7.10
* - add nonseekable open flag
*
* 7.11
* - add IOCTL message
* - add unsolicited notification support
* - add POLL message and NOTIFY_POLL notification
*
* 7.12
* - add umask flag to input argument of create, mknod and mkdir
* - add notification messages for invalidation of inodes and
* directory entries
*
* 7.13
* - make max number of background requests and congestion threshold
* tunables
fuse: support splice() writing to fuse device Allow userspace filesystem implementation to use splice() to write to the fuse device. The semantics of using splice() are: 1) buffer the message header and data in a temporary pipe 2) with a *single* splice() call move the message from the temporary pipe to the fuse device The READ reply message has the most interesting use for this, since now the data from an arbitrary file descriptor (which could be a regular file, a block device or a socket) can be tranferred into the fuse device without having to go through a userspace buffer. It will also allow zero copy moving of pages. One caveat is that the protocol on the fuse device requires the length of the whole message to be written into the header. But the length of the data transferred into the temporary pipe may not be known in advance. The current library implementation works around this by using vmplice to write the header and modifying the header after splicing the data into the pipe (error handling omitted): struct fuse_out_header out; iov.iov_base = &out; iov.iov_len = sizeof(struct fuse_out_header); vmsplice(pip[1], &iov, 1, 0); len = splice(input_fd, input_offset, pip[1], NULL, len, 0); /* retrospectively modify the header: */ out.len = len + sizeof(struct fuse_out_header); splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags); This works since vmsplice only saves a pointer to the data, it does not copy the data itself. Since pipes are currently limited to 16 pages and messages need to be spliced atomically, the length of the data is limited to 15 pages (or 60kB for 4k pages). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 13:06:06 +00:00
*
* 7.14
* - add splice support to fuse device
*
* 7.15
* - add store notify
* - add retrieve notify
*
* 7.16
* - add BATCH_FORGET request
* - FUSE_IOCTL_UNRESTRICTED shall now return with array of 'struct
* fuse_ioctl_iovec' instead of ambiguous 'struct iovec'
* - add FUSE_IOCTL_32BIT flag
*
* 7.17
* - add FUSE_FLOCK_LOCKS and FUSE_RELEASE_FLOCK_UNLOCK
*
* 7.18
* - add FUSE_IOCTL_DIR flag
FUSE: Notifying the kernel of deletion. Allows a FUSE file-system to tell the kernel when a file or directory is deleted. If the specified dentry has the specified inode number, the kernel will unhash it. The current 'fuse_notify_inval_entry' does not cause the kernel to clean up directories that are in use properly, and as a result the users of those directories see incorrect semantics from the file-system. The error condition seen when 'fuse_notify_inval_entry' is used to notify of a deleted directory is avoided when 'fuse_notify_delete' is used instead. The following scenario demonstrates the difference: 1. User A chdirs into 'testdir' and starts reading 'testfile'. 2. User B rm -rf 'testdir'. 3. User B creates 'testdir'. 4. User C chdirs into 'testdir'. If you run the above within the same machine on any file-system (including fuse file-systems), there is no problem: user C is able to chdir into the new testdir. The old testdir is removed from the dentry tree, but still open by user A. If operations 2 and 3 are performed via the network such that the fuse file-system uses one of the notify functions to tell the kernel that the nodes are gone, then the following error occurs for user C while user A holds the original directory open: muirj@empacher:~> ls /test/testdir ls: cannot access /test/testdir: No such file or directory The issue here is that the kernel still has a dentry for testdir, and so it is requesting the attributes for the old directory, while the file-system is responding that the directory no longer exists. If on the other hand, if the file-system can notify the kernel that the directory is deleted using the new 'fuse_notify_delete' function, then the above ls will find the new directory as expected. Signed-off-by: John Muir <john@jmuir.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2011-12-06 20:50:06 +00:00
* - add FUSE_NOTIFY_DELETE
*
* 7.19
* - add FUSE_FALLOCATE
*
* 7.20
* - add FUSE_AUTO_INVAL_DATA
*
* 7.21
* - add FUSE_READDIRPLUS
* - send the requested events in POLL request
*
* 7.22
* - add FUSE_ASYNC_DIO
*
* 7.23
* - add FUSE_WRITEBACK_CACHE
* - add time_gran to fuse_init_out
* - add reserved space to fuse_init_out
* - add FATTR_CTIME
* - add ctime and ctimensec to fuse_setattr_in
* - add FUSE_RENAME2 request
* - add FUSE_NO_OPEN_SUPPORT flag
*
* 7.24
* - add FUSE_LSEEK for SEEK_HOLE and SEEK_DATA support
*
* 7.25
* - add FUSE_PARALLEL_DIROPS
*
* 7.26
* - add FUSE_HANDLE_KILLPRIV
* - add FUSE_POSIX_ACL
*
* 7.27
* - add FUSE_ABORT_ERROR
*
* 7.28
* - add FUSE_COPY_FILE_RANGE
* - add FOPEN_CACHE_DIR
fuse: add max_pages to init_out Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to improve performance. Old RFC with detailed description of the problem and many fixes by Mitsuo Hayasaka (mitsuo.hayasaka.hu@hitachi.com): - https://lkml.org/lkml/2012/7/5/136 We've encountered performance degradation and fixed it on a big and complex virtual environment. Environment to reproduce degradation and improvement: 1. Add lag to user mode FUSE Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in passthrough_fh.c 2. patch UM fuse with configurable max_pages parameter. The patch will be provided latter. 3. run test script and perform test on tmpfs fuse_test() { cd /tmp mkdir -p fusemnt passthrough_fh -o max_pages=$1 /tmp/fusemnt grep fuse /proc/self/mounts dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \ count=1K bs=1M 2>&1 | grep -v records rm fusemnt/tmp/tmp killall passthrough_fh } Test results: passthrough_fh /tmp/fusemnt fuse.passthrough_fh \ rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0 1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s passthrough_fh /tmp/fusemnt fuse.passthrough_fh \ rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0 1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s Obviously with bigger lag the difference between 'before' and 'after' will be more significant. Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136), observed improvement from 400-550 to 520-740. Signed-off-by: Constantine Shulyupin <const@MakeLinux.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-06 12:37:06 +00:00
* - add FUSE_MAX_PAGES, add max_pages to init_out
* - add FUSE_CACHE_SYMLINKS
*
* 7.29
* - add FUSE_NO_OPENDIR_SUPPORT flag
fuse: allow filesystems to have precise control over data cache On networked filesystems file data can be changed externally. FUSE provides notification messages for filesystem to inform kernel that metadata or data region of a file needs to be invalidated in local page cache. That provides the basis for filesystem implementations to invalidate kernel cache explicitly based on observed filesystem-specific events. FUSE has also "automatic" invalidation mode(*) when the kernel automatically invalidates data cache of a file if it sees mtime change. It also automatically invalidates whole data cache of a file if it sees file size being changed. The automatic mode has corresponding capability - FUSE_AUTO_INVAL_DATA. However, due to probably historical reason, that capability controls only whether mtime change should be resulting in automatic invalidation or not. A change in file size always results in invalidating whole data cache of a file irregardless of whether FUSE_AUTO_INVAL_DATA was negotiated(+). The filesystem I write[1] represents data arrays stored in networked database as local files suitable for mmap. It is read-only filesystem - changes to data are committed externally via database interfaces and the filesystem only glues data into contiguous file streams suitable for mmap and traditional array processing. The files are big - starting from hundreds gigabytes and more. The files change regularly, and frequently by data being appended to their end. The size of files thus changes frequently. If a file was accessed locally and some part of its data got into page cache, we want that data to stay cached unless there is memory pressure, or unless corresponding part of the file was actually changed. However current FUSE behaviour - when it sees file size change - is to invalidate the whole file. The data cache of the file is thus completely lost even on small size change, and despite that the filesystem server is careful to accurately translate database changes into FUSE invalidation messages to kernel. Let's fix it: if a filesystem, through new FUSE_EXPLICIT_INVAL_DATA capability, indicates to kernel that it is fully responsible for data cache invalidation, then the kernel won't invalidate files data cache on size change and only truncate that cache to new size in case the size decreased. (*) see 72d0d248ca "fuse: add FUSE_AUTO_INVAL_DATA init flag", eed2179efe "fuse: invalidate inode mapping if mtime changes" (+) in writeback mode the kernel does not invalidate data cache on file size change, but neither it allows the filesystem to set the size due to external event (see 8373200b12 "fuse: Trust kernel i_size only") [1] https://lab.nexedi.com/kirr/wendelin.core/blob/a50f1d9f/wcfs/wcfs.go#L20 Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-03-27 11:14:15 +00:00
*
* 7.30
* - add FUSE_EXPLICIT_INVAL_DATA
* - add FUSE_IOCTL_COMPAT_X32
*
* 7.31
* - add FUSE_WRITE_KILL_PRIV flag
* - add FUSE_SETUPMAPPING and FUSE_REMOVEMAPPING
* - add map_alignment to fuse_init_out, add FUSE_MAP_ALIGNMENT flag
*
* 7.32
* - add flags to fuse_attr, add FUSE_ATTR_SUBMOUNT, add FUSE_SUBMOUNTS
*
* 7.33
* - add FUSE_HANDLE_KILLPRIV_V2, FUSE_WRITE_KILL_SUIDGID, FATTR_KILL_SUIDGID
* - add FUSE_OPEN_KILL_SUIDGID
* - extend fuse_setxattr_in, add FUSE_SETXATTR_EXT
* - add FUSE_SETXATTR_ACL_KILL_SGID
virtiofs: propagate sync() to file server Even if POSIX doesn't mandate it, linux users legitimately expect sync() to flush all data and metadata to physical storage when it is located on the same system. This isn't happening with virtiofs though: sync() inside the guest returns right away even though data still needs to be flushed from the host page cache. This is easily demonstrated by doing the following in the guest: $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync 5120+0 records in 5120+0 records out 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s sync() = 0 <0.024068> and start the following in the host when the 'dd' command completes in the guest: $ strace -T -e fsync /usr/bin/sync virtiofs/foo fsync(3) = 0 <10.371640> There are no good reasons not to honor the expected behavior of sync() actually: it gives an unrealistic impression that virtiofs is super fast and that data has safely landed on HW, which isn't the case obviously. Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS request type for this purpose. Provision a 64-bit placeholder for possible future extensions. Since the file server cannot handle the wait == 0 case, we skip it to avoid a gratuitous roundtrip. Note that this is per-superblock: a FUSE_SYNCFS is send for the root mount and for each submount. Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in the file server is treated as permanent success. This ensures compatibility with older file servers: the client will get the current behavior of sync() not being propagated to the file server. Note that such an operation allows the file server to DoS sync(). Since a typical FUSE file server is an untrusted piece of software running in userspace, this is disabled by default. Only enable it with virtiofs for now since virtiofsd is supposedly trusted by the guest kernel. Reported-by: Robert Krawitz <rlk@redhat.com> Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-05-20 15:46:54 +00:00
*
* 7.34
* - add FUSE_SYNCFS
*
* 7.35
* - add FOPEN_NOFLUSH
*
* 7.36
* - extend fuse_init_in with reserved fields, add FUSE_INIT_EXT init flag
* - add flags2 to fuse_init_in and fuse_init_out
fuse: send security context of inode on file When a new inode is created, send its security context to server along with creation request (FUSE_CREAT, FUSE_MKNOD, FUSE_MKDIR and FUSE_SYMLINK). This gives server an opportunity to create new file and set security context (possibly atomically). In all the configurations it might not be possible to set context atomically. Like nfs and ceph, use security_dentry_init_security() to dermine security context of inode and send it with create, mkdir, mknod, and symlink requests. Following is the information sent to server. fuse_sectx_header, fuse_secctx, xattr_name, security_context - struct fuse_secctx_header This contains total number of security contexts being sent and total size of all the security contexts (including size of fuse_secctx_header). - struct fuse_secctx This contains size of security context which follows this structure. There is one fuse_secctx instance per security context. - xattr name string This string represents name of xattr which should be used while setting security context. - security context This is the actual security context whose size is specified in fuse_secctx struct. Also add the FUSE_SECURITY_CTX flag for the `flags` field of the fuse_init_out struct. When this flag is set the kernel will append the security context for a newly created inode to the request (create, mkdir, mknod, and symlink). The server is responsible for ensuring that the inode appears atomically (preferrably) with the requested security context. For example, If the server is using SELinux and backed by a "real" linux file system that supports extended attributes it can write the security context value to /proc/thread-self/attr/fscreate before making the syscall to create the inode. This patch is based on patch from Chirantan Ekbote <chirantan@chromium.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-11 14:32:49 +00:00
* - add FUSE_SECURITY_CTX init flag
* - add security context to create, mkdir, symlink, and mknod requests
* - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
*
* 7.37
* - add FUSE_TMPFILE
*
* 7.38
* - add FUSE_EXPIRE_ONLY flag to fuse_notify_inval_entry
fuse: allow non-extending parallel direct writes on the same file In general, as of now, in FUSE, direct writes on the same file are serialized over inode lock i.e we hold inode lock for the full duration of the write request. I could not find in fuse code and git history a comment which clearly explains why this exclusive lock is taken for direct writes. Following might be the reasons for acquiring an exclusive lock but not be limited to 1) Our guess is some USER space fuse implementations might be relying on this lock for serialization. 2) The lock protects against file read/write size races. 3) Ruling out any issues arising from partial write failures. This patch relaxes the exclusive lock for direct non-extending writes only. File size extending writes might not need the lock either, but we are not entirely sure if there is a risk to introduce any kind of regression. Furthermore, benchmarking with fio does not show a difference between patch versions that take on file size extension a) an exclusive lock and b) a shared lock. A possible example of an issue with i_size extending writes are write error cases. Some writes might succeed and others might fail for file system internal reasons - for example ENOSPACE. With parallel file size extending writes it _might_ be difficult to revert the action of the failing write, especially to restore the right i_size. With these changes, we allow non-extending parallel direct writes on the same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES. If this flag is set on the file (flag is passed from libfuse to fuse kernel as part of file open/create), we do not take exclusive lock anymore, but instead use a shared lock that allows non-extending writes to run in parallel. FUSE implementations which rely on this inode lock for serialization can continue to do so and serialized direct writes are still the default. Implementations that do not do write serialization need to be updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file open/create reply. On patch review there were concerns that network file systems (or vfs multiple mounts of the same file system) might have issues with parallel writes. We believe this is not the case, as this is just a local lock, which network file systems could not rely on anyway. I.e. this lock is just for local consistency. Signed-off-by: Dharmendra Singh <dsingh@ddn.com> Signed-off-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-06-17 07:10:27 +00:00
* - add FOPEN_PARALLEL_DIRECT_WRITES
* - add total_extlen to fuse_in_header
* - add FUSE_MAX_NR_SECCTX
* - add extension header
* - add FUSE_EXT_GROUPS
* - add FUSE_CREATE_SUPP_GROUP
* - add FUSE_HAS_EXPIRE_ONLY
*
* 7.39
* - add FUSE_DIRECT_IO_ALLOW_MMAP
* - add FUSE_STATX and related structures
*
* 7.40
* - add max_stack_depth to fuse_init_out, add FUSE_PASSTHROUGH init flag
* - add backing_id to fuse_open_out, add FOPEN_PASSTHROUGH open flag
fuse: add support for explicit export disabling open_by_handle_at(2) can fail with -ESTALE with a valid handle returned by a previous name_to_handle_at(2) for evicted fuse inodes, which is especially common when entry_valid_timeout is 0, e.g. when the fuse daemon is in "cache=none" mode. The time sequence is like: name_to_handle_at(2) # succeed evict fuse inode open_by_handle_at(2) # fail The root cause is that, with 0 entry_valid_timeout, the dput() called in name_to_handle_at(2) will trigger iput -> evict(), which will send FUSE_FORGET to the daemon. The following open_by_handle_at(2) will send a new FUSE_LOOKUP request upon inode cache miss since the previous inode eviction. Then the fuse daemon may fail the FUSE_LOOKUP request with -ENOENT as the cached metadata of the requested inode has already been cleaned up during the previous FUSE_FORGET. The returned -ENOENT is treated as -ESTALE when open_by_handle_at(2) returns. This confuses the application somehow, as open_by_handle_at(2) fails when the previous name_to_handle_at(2) succeeds. The returned errno is also confusing as the requested file is not deleted and already there. It is reasonable to fail name_to_handle_at(2) early in this case, after which the application can fallback to open(2) to access files. Since this issue typically appears when entry_valid_timeout is 0 which is configured by the fuse daemon, the fuse daemon is the right person to explicitly disable the export when required. Also considering FUSE_EXPORT_SUPPORT actually indicates the support for lookups of "." and "..", and there are existing fuse daemons supporting export without FUSE_EXPORT_SUPPORT set, for compatibility, we add a new INIT flag for such purpose. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-26 03:54:35 +00:00
* - add FUSE_NO_EXPORT_SUPPORT init flag
* - add FUSE_NOTIFY_RESEND, add FUSE_HAS_RESEND init flag
*/
[PATCH] FUSE - core This patch adds FUSE core. This contains the following files: o inode.c - superblock operations (alloc_inode, destroy_inode, read_inode, clear_inode, put_super, show_options) - registers FUSE filesystem o fuse_i.h - private header file Requirements ============ The most important difference between orinary filesystems and FUSE is the fact, that the filesystem data/metadata is provided by a userspace process run with the privileges of the mount "owner" instead of the kernel, or some remote entity usually running with elevated privileges. The security implication of this is that a non-privileged user must not be able to use this capability to compromise the system. Obvious requirements arising from this are: - mount owner should not be able to get elevated privileges with the help of the mounted filesystem - mount owner should not be able to induce undesired behavior in other users' or the super user's processes - mount owner should not get illegitimate access to information from other users' and the super user's processes These are currently ensured with the following constraints: 1) mount is only allowed to directory or file which the mount owner can modify without limitation (write access + no sticky bit for directories) 2) nosuid,nodev mount options are forced 3) any process running with fsuid different from the owner is denied all access to the filesystem 1) and 2) are ensured by the "fusermount" mount utility which is a setuid root application doing the actual mount operation. 3) is ensured by a check in the permission() method in kernel I started thinking about doing 3) in a different way because Christoph H. made a big deal out of it, saying that FUSE is unacceptable into mainline in this form. The suggested use of private namespaces would be OK, but in their current form have many limitations that make their use impractical (as discussed in this thread). Suggested improvements that would address these limitations: - implement shared subtrees - allow a process to join an existing namespace (make namespaces first-class objects) - implement the namespace creation/joining in a PAM module With all that in place the check of owner against current->fsuid may be removed from the FUSE kernel module, without compromising the security requirements. Suid programs still interesting questions, since they get access even to the private namespace causing some information leak (exact order/timing of filesystem operations performed), giving some ptrace-like capabilities to unprivileged users. BTW this problem is not strictly limited to the namespace approach, since suid programs setting fsuid and accessing users' files will succeed with the current approach too. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 20:10:26 +00:00
#ifndef _LINUX_FUSE_H
#define _LINUX_FUSE_H
#ifdef __KERNEL__
#include <linux/types.h>
#else
#include <stdint.h>
#endif
[PATCH] FUSE - core This patch adds FUSE core. This contains the following files: o inode.c - superblock operations (alloc_inode, destroy_inode, read_inode, clear_inode, put_super, show_options) - registers FUSE filesystem o fuse_i.h - private header file Requirements ============ The most important difference between orinary filesystems and FUSE is the fact, that the filesystem data/metadata is provided by a userspace process run with the privileges of the mount "owner" instead of the kernel, or some remote entity usually running with elevated privileges. The security implication of this is that a non-privileged user must not be able to use this capability to compromise the system. Obvious requirements arising from this are: - mount owner should not be able to get elevated privileges with the help of the mounted filesystem - mount owner should not be able to induce undesired behavior in other users' or the super user's processes - mount owner should not get illegitimate access to information from other users' and the super user's processes These are currently ensured with the following constraints: 1) mount is only allowed to directory or file which the mount owner can modify without limitation (write access + no sticky bit for directories) 2) nosuid,nodev mount options are forced 3) any process running with fsuid different from the owner is denied all access to the filesystem 1) and 2) are ensured by the "fusermount" mount utility which is a setuid root application doing the actual mount operation. 3) is ensured by a check in the permission() method in kernel I started thinking about doing 3) in a different way because Christoph H. made a big deal out of it, saying that FUSE is unacceptable into mainline in this form. The suggested use of private namespaces would be OK, but in their current form have many limitations that make their use impractical (as discussed in this thread). Suggested improvements that would address these limitations: - implement shared subtrees - allow a process to join an existing namespace (make namespaces first-class objects) - implement the namespace creation/joining in a PAM module With all that in place the check of owner against current->fsuid may be removed from the FUSE kernel module, without compromising the security requirements. Suid programs still interesting questions, since they get access even to the private namespace causing some information leak (exact order/timing of filesystem operations performed), giving some ptrace-like capabilities to unprivileged users. BTW this problem is not strictly limited to the namespace approach, since suid programs setting fsuid and accessing users' files will succeed with the current approach too. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 20:10:26 +00:00
/*
* Version negotiation:
*
* Both the kernel and userspace send the version they support in the
* INIT request and reply respectively.
*
* If the major versions match then both shall use the smallest
* of the two minor versions for communication.
*
* If the kernel supports a larger major version, then userspace shall
* reply with the major version it supports, ignore the rest of the
* INIT message and expect a new INIT message from the kernel with a
* matching major version.
*
* If the library supports a larger major version, then it shall fall
* back to the major protocol version sent by the kernel for
* communication and reply with that major version (and an arbitrary
* supported minor version).
*/
[PATCH] FUSE - core This patch adds FUSE core. This contains the following files: o inode.c - superblock operations (alloc_inode, destroy_inode, read_inode, clear_inode, put_super, show_options) - registers FUSE filesystem o fuse_i.h - private header file Requirements ============ The most important difference between orinary filesystems and FUSE is the fact, that the filesystem data/metadata is provided by a userspace process run with the privileges of the mount "owner" instead of the kernel, or some remote entity usually running with elevated privileges. The security implication of this is that a non-privileged user must not be able to use this capability to compromise the system. Obvious requirements arising from this are: - mount owner should not be able to get elevated privileges with the help of the mounted filesystem - mount owner should not be able to induce undesired behavior in other users' or the super user's processes - mount owner should not get illegitimate access to information from other users' and the super user's processes These are currently ensured with the following constraints: 1) mount is only allowed to directory or file which the mount owner can modify without limitation (write access + no sticky bit for directories) 2) nosuid,nodev mount options are forced 3) any process running with fsuid different from the owner is denied all access to the filesystem 1) and 2) are ensured by the "fusermount" mount utility which is a setuid root application doing the actual mount operation. 3) is ensured by a check in the permission() method in kernel I started thinking about doing 3) in a different way because Christoph H. made a big deal out of it, saying that FUSE is unacceptable into mainline in this form. The suggested use of private namespaces would be OK, but in their current form have many limitations that make their use impractical (as discussed in this thread). Suggested improvements that would address these limitations: - implement shared subtrees - allow a process to join an existing namespace (make namespaces first-class objects) - implement the namespace creation/joining in a PAM module With all that in place the check of owner against current->fsuid may be removed from the FUSE kernel module, without compromising the security requirements. Suid programs still interesting questions, since they get access even to the private namespace causing some information leak (exact order/timing of filesystem operations performed), giving some ptrace-like capabilities to unprivileged users. BTW this problem is not strictly limited to the namespace approach, since suid programs setting fsuid and accessing users' files will succeed with the current approach too. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 20:10:26 +00:00
/** Version number of this interface */
#define FUSE_KERNEL_VERSION 7
[PATCH] FUSE - core This patch adds FUSE core. This contains the following files: o inode.c - superblock operations (alloc_inode, destroy_inode, read_inode, clear_inode, put_super, show_options) - registers FUSE filesystem o fuse_i.h - private header file Requirements ============ The most important difference between orinary filesystems and FUSE is the fact, that the filesystem data/metadata is provided by a userspace process run with the privileges of the mount "owner" instead of the kernel, or some remote entity usually running with elevated privileges. The security implication of this is that a non-privileged user must not be able to use this capability to compromise the system. Obvious requirements arising from this are: - mount owner should not be able to get elevated privileges with the help of the mounted filesystem - mount owner should not be able to induce undesired behavior in other users' or the super user's processes - mount owner should not get illegitimate access to information from other users' and the super user's processes These are currently ensured with the following constraints: 1) mount is only allowed to directory or file which the mount owner can modify without limitation (write access + no sticky bit for directories) 2) nosuid,nodev mount options are forced 3) any process running with fsuid different from the owner is denied all access to the filesystem 1) and 2) are ensured by the "fusermount" mount utility which is a setuid root application doing the actual mount operation. 3) is ensured by a check in the permission() method in kernel I started thinking about doing 3) in a different way because Christoph H. made a big deal out of it, saying that FUSE is unacceptable into mainline in this form. The suggested use of private namespaces would be OK, but in their current form have many limitations that make their use impractical (as discussed in this thread). Suggested improvements that would address these limitations: - implement shared subtrees - allow a process to join an existing namespace (make namespaces first-class objects) - implement the namespace creation/joining in a PAM module With all that in place the check of owner against current->fsuid may be removed from the FUSE kernel module, without compromising the security requirements. Suid programs still interesting questions, since they get access even to the private namespace causing some information leak (exact order/timing of filesystem operations performed), giving some ptrace-like capabilities to unprivileged users. BTW this problem is not strictly limited to the namespace approach, since suid programs setting fsuid and accessing users' files will succeed with the current approach too. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 20:10:26 +00:00
/** Minor version number of this interface */
#define FUSE_KERNEL_MINOR_VERSION 40
[PATCH] FUSE - core This patch adds FUSE core. This contains the following files: o inode.c - superblock operations (alloc_inode, destroy_inode, read_inode, clear_inode, put_super, show_options) - registers FUSE filesystem o fuse_i.h - private header file Requirements ============ The most important difference between orinary filesystems and FUSE is the fact, that the filesystem data/metadata is provided by a userspace process run with the privileges of the mount "owner" instead of the kernel, or some remote entity usually running with elevated privileges. The security implication of this is that a non-privileged user must not be able to use this capability to compromise the system. Obvious requirements arising from this are: - mount owner should not be able to get elevated privileges with the help of the mounted filesystem - mount owner should not be able to induce undesired behavior in other users' or the super user's processes - mount owner should not get illegitimate access to information from other users' and the super user's processes These are currently ensured with the following constraints: 1) mount is only allowed to directory or file which the mount owner can modify without limitation (write access + no sticky bit for directories) 2) nosuid,nodev mount options are forced 3) any process running with fsuid different from the owner is denied all access to the filesystem 1) and 2) are ensured by the "fusermount" mount utility which is a setuid root application doing the actual mount operation. 3) is ensured by a check in the permission() method in kernel I started thinking about doing 3) in a different way because Christoph H. made a big deal out of it, saying that FUSE is unacceptable into mainline in this form. The suggested use of private namespaces would be OK, but in their current form have many limitations that make their use impractical (as discussed in this thread). Suggested improvements that would address these limitations: - implement shared subtrees - allow a process to join an existing namespace (make namespaces first-class objects) - implement the namespace creation/joining in a PAM module With all that in place the check of owner against current->fsuid may be removed from the FUSE kernel module, without compromising the security requirements. Suid programs still interesting questions, since they get access even to the private namespace causing some information leak (exact order/timing of filesystem operations performed), giving some ptrace-like capabilities to unprivileged users. BTW this problem is not strictly limited to the namespace approach, since suid programs setting fsuid and accessing users' files will succeed with the current approach too. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 20:10:26 +00:00
/** The node ID of the root inode */
#define FUSE_ROOT_ID 1
/* Make sure all structures are padded to 64bit boundary, so 32bit
userspace works under 64bit kernels */
[PATCH] FUSE - core This patch adds FUSE core. This contains the following files: o inode.c - superblock operations (alloc_inode, destroy_inode, read_inode, clear_inode, put_super, show_options) - registers FUSE filesystem o fuse_i.h - private header file Requirements ============ The most important difference between orinary filesystems and FUSE is the fact, that the filesystem data/metadata is provided by a userspace process run with the privileges of the mount "owner" instead of the kernel, or some remote entity usually running with elevated privileges. The security implication of this is that a non-privileged user must not be able to use this capability to compromise the system. Obvious requirements arising from this are: - mount owner should not be able to get elevated privileges with the help of the mounted filesystem - mount owner should not be able to induce undesired behavior in other users' or the super user's processes - mount owner should not get illegitimate access to information from other users' and the super user's processes These are currently ensured with the following constraints: 1) mount is only allowed to directory or file which the mount owner can modify without limitation (write access + no sticky bit for directories) 2) nosuid,nodev mount options are forced 3) any process running with fsuid different from the owner is denied all access to the filesystem 1) and 2) are ensured by the "fusermount" mount utility which is a setuid root application doing the actual mount operation. 3) is ensured by a check in the permission() method in kernel I started thinking about doing 3) in a different way because Christoph H. made a big deal out of it, saying that FUSE is unacceptable into mainline in this form. The suggested use of private namespaces would be OK, but in their current form have many limitations that make their use impractical (as discussed in this thread). Suggested improvements that would address these limitations: - implement shared subtrees - allow a process to join an existing namespace (make namespaces first-class objects) - implement the namespace creation/joining in a PAM module With all that in place the check of owner against current->fsuid may be removed from the FUSE kernel module, without compromising the security requirements. Suid programs still interesting questions, since they get access even to the private namespace causing some information leak (exact order/timing of filesystem operations performed), giving some ptrace-like capabilities to unprivileged users. BTW this problem is not strictly limited to the namespace approach, since suid programs setting fsuid and accessing users' files will succeed with the current approach too. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 20:10:26 +00:00
struct fuse_attr {
uint64_t ino;
uint64_t size;
uint64_t blocks;
uint64_t atime;
uint64_t mtime;
uint64_t ctime;
uint32_t atimensec;
uint32_t mtimensec;
uint32_t ctimensec;
uint32_t mode;
uint32_t nlink;
uint32_t uid;
uint32_t gid;
uint32_t rdev;
uint32_t blksize;
uint32_t flags;
[PATCH] FUSE - core This patch adds FUSE core. This contains the following files: o inode.c - superblock operations (alloc_inode, destroy_inode, read_inode, clear_inode, put_super, show_options) - registers FUSE filesystem o fuse_i.h - private header file Requirements ============ The most important difference between orinary filesystems and FUSE is the fact, that the filesystem data/metadata is provided by a userspace process run with the privileges of the mount "owner" instead of the kernel, or some remote entity usually running with elevated privileges. The security implication of this is that a non-privileged user must not be able to use this capability to compromise the system. Obvious requirements arising from this are: - mount owner should not be able to get elevated privileges with the help of the mounted filesystem - mount owner should not be able to induce undesired behavior in other users' or the super user's processes - mount owner should not get illegitimate access to information from other users' and the super user's processes These are currently ensured with the following constraints: 1) mount is only allowed to directory or file which the mount owner can modify without limitation (write access + no sticky bit for directories) 2) nosuid,nodev mount options are forced 3) any process running with fsuid different from the owner is denied all access to the filesystem 1) and 2) are ensured by the "fusermount" mount utility which is a setuid root application doing the actual mount operation. 3) is ensured by a check in the permission() method in kernel I started thinking about doing 3) in a different way because Christoph H. made a big deal out of it, saying that FUSE is unacceptable into mainline in this form. The suggested use of private namespaces would be OK, but in their current form have many limitations that make their use impractical (as discussed in this thread). Suggested improvements that would address these limitations: - implement shared subtrees - allow a process to join an existing namespace (make namespaces first-class objects) - implement the namespace creation/joining in a PAM module With all that in place the check of owner against current->fsuid may be removed from the FUSE kernel module, without compromising the security requirements. Suid programs still interesting questions, since they get access even to the private namespace causing some information leak (exact order/timing of filesystem operations performed), giving some ptrace-like capabilities to unprivileged users. BTW this problem is not strictly limited to the namespace approach, since suid programs setting fsuid and accessing users' files will succeed with the current approach too. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 20:10:26 +00:00
};
/*
* The following structures are bit-for-bit compatible with the statx(2) ABI in
* Linux.
*/
struct fuse_sx_time {
int64_t tv_sec;
uint32_t tv_nsec;
int32_t __reserved;
};
struct fuse_statx {
uint32_t mask;
uint32_t blksize;
uint64_t attributes;
uint32_t nlink;
uint32_t uid;
uint32_t gid;
uint16_t mode;
uint16_t __spare0[1];
uint64_t ino;
uint64_t size;
uint64_t blocks;
uint64_t attributes_mask;
struct fuse_sx_time atime;
struct fuse_sx_time btime;
struct fuse_sx_time ctime;
struct fuse_sx_time mtime;
uint32_t rdev_major;
uint32_t rdev_minor;
uint32_t dev_major;
uint32_t dev_minor;
uint64_t __spare2[14];
};
struct fuse_kstatfs {
uint64_t blocks;
uint64_t bfree;
uint64_t bavail;
uint64_t files;
uint64_t ffree;
uint32_t bsize;
uint32_t namelen;
uint32_t frsize;
uint32_t padding;
uint32_t spare[6];
};
struct fuse_file_lock {
uint64_t start;
uint64_t end;
uint32_t type;
uint32_t pid; /* tgid */
};
/**
* Bitmasks for fuse_setattr_in.valid
*/
#define FATTR_MODE (1 << 0)
#define FATTR_UID (1 << 1)
#define FATTR_GID (1 << 2)
#define FATTR_SIZE (1 << 3)
#define FATTR_ATIME (1 << 4)
#define FATTR_MTIME (1 << 5)
#define FATTR_FH (1 << 6)
#define FATTR_ATIME_NOW (1 << 7)
#define FATTR_MTIME_NOW (1 << 8)
#define FATTR_LOCKOWNER (1 << 9)
#define FATTR_CTIME (1 << 10)
#define FATTR_KILL_SUIDGID (1 << 11)
/**
* Flags returned by the OPEN request
*
* FOPEN_DIRECT_IO: bypass page cache for this open file
* FOPEN_KEEP_CACHE: don't invalidate the data cache on open
* FOPEN_NONSEEKABLE: the file is not seekable
* FOPEN_CACHE_DIR: allow caching this directory
fuse: Add FOPEN_STREAM to use stream_open() Starting from commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") files opened even via nonseekable_open gate read and write via lock and do not allow them to be run simultaneously. This can create read vs write deadlock if a filesystem is trying to implement a socket-like file which is intended to be simultaneously used for both read and write from filesystem client. See commit 10dce8af3422 ("fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock") for details and e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes to /proc/xen/xenbus") for a similar deadlock example on /proc/xen/xenbus. To avoid such deadlock it was tempting to adjust fuse_finish_open to use stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually uses offset in its read and write handlers https://codesearch.debian.net/search?q=-%3Enonseekable+%3D https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 so if we would do such a change it will break a real user. Add another flag (FOPEN_STREAM) for filesystem servers to indicate that the opened handler is having stream-like semantics; does not use file position and thus the kernel is free to issue simultaneous read and write request on opened file handle. This patch together with stream_open() should be added to stable kernels starting from v3.14+. This will allow to patch OSSPD and other FUSE filesystems that provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on all kernel versions. This should work because fuse_finish_open ignores unknown open flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that is not aware of this flag cannot hurt. In turn the kernel that is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to implement streams without read vs write deadlock. Cc: stable@vger.kernel.org # v3.14+ Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24 07:13:57 +00:00
* FOPEN_STREAM: the file is stream-like (no file position at all)
* FOPEN_NOFLUSH: don't flush data cache on close (unless FUSE_WRITEBACK_CACHE)
fuse: allow non-extending parallel direct writes on the same file In general, as of now, in FUSE, direct writes on the same file are serialized over inode lock i.e we hold inode lock for the full duration of the write request. I could not find in fuse code and git history a comment which clearly explains why this exclusive lock is taken for direct writes. Following might be the reasons for acquiring an exclusive lock but not be limited to 1) Our guess is some USER space fuse implementations might be relying on this lock for serialization. 2) The lock protects against file read/write size races. 3) Ruling out any issues arising from partial write failures. This patch relaxes the exclusive lock for direct non-extending writes only. File size extending writes might not need the lock either, but we are not entirely sure if there is a risk to introduce any kind of regression. Furthermore, benchmarking with fio does not show a difference between patch versions that take on file size extension a) an exclusive lock and b) a shared lock. A possible example of an issue with i_size extending writes are write error cases. Some writes might succeed and others might fail for file system internal reasons - for example ENOSPACE. With parallel file size extending writes it _might_ be difficult to revert the action of the failing write, especially to restore the right i_size. With these changes, we allow non-extending parallel direct writes on the same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES. If this flag is set on the file (flag is passed from libfuse to fuse kernel as part of file open/create), we do not take exclusive lock anymore, but instead use a shared lock that allows non-extending writes to run in parallel. FUSE implementations which rely on this inode lock for serialization can continue to do so and serialized direct writes are still the default. Implementations that do not do write serialization need to be updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file open/create reply. On patch review there were concerns that network file systems (or vfs multiple mounts of the same file system) might have issues with parallel writes. We believe this is not the case, as this is just a local lock, which network file systems could not rely on anyway. I.e. this lock is just for local consistency. Signed-off-by: Dharmendra Singh <dsingh@ddn.com> Signed-off-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-06-17 07:10:27 +00:00
* FOPEN_PARALLEL_DIRECT_WRITES: Allow concurrent direct writes on the same inode
* FOPEN_PASSTHROUGH: passthrough read/write io for this open file
*/
#define FOPEN_DIRECT_IO (1 << 0)
#define FOPEN_KEEP_CACHE (1 << 1)
#define FOPEN_NONSEEKABLE (1 << 2)
#define FOPEN_CACHE_DIR (1 << 3)
fuse: Add FOPEN_STREAM to use stream_open() Starting from commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") files opened even via nonseekable_open gate read and write via lock and do not allow them to be run simultaneously. This can create read vs write deadlock if a filesystem is trying to implement a socket-like file which is intended to be simultaneously used for both read and write from filesystem client. See commit 10dce8af3422 ("fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock") for details and e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes to /proc/xen/xenbus") for a similar deadlock example on /proc/xen/xenbus. To avoid such deadlock it was tempting to adjust fuse_finish_open to use stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually uses offset in its read and write handlers https://codesearch.debian.net/search?q=-%3Enonseekable+%3D https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 so if we would do such a change it will break a real user. Add another flag (FOPEN_STREAM) for filesystem servers to indicate that the opened handler is having stream-like semantics; does not use file position and thus the kernel is free to issue simultaneous read and write request on opened file handle. This patch together with stream_open() should be added to stable kernels starting from v3.14+. This will allow to patch OSSPD and other FUSE filesystems that provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on all kernel versions. This should work because fuse_finish_open ignores unknown open flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that is not aware of this flag cannot hurt. In turn the kernel that is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to implement streams without read vs write deadlock. Cc: stable@vger.kernel.org # v3.14+ Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24 07:13:57 +00:00
#define FOPEN_STREAM (1 << 4)
#define FOPEN_NOFLUSH (1 << 5)
fuse: allow non-extending parallel direct writes on the same file In general, as of now, in FUSE, direct writes on the same file are serialized over inode lock i.e we hold inode lock for the full duration of the write request. I could not find in fuse code and git history a comment which clearly explains why this exclusive lock is taken for direct writes. Following might be the reasons for acquiring an exclusive lock but not be limited to 1) Our guess is some USER space fuse implementations might be relying on this lock for serialization. 2) The lock protects against file read/write size races. 3) Ruling out any issues arising from partial write failures. This patch relaxes the exclusive lock for direct non-extending writes only. File size extending writes might not need the lock either, but we are not entirely sure if there is a risk to introduce any kind of regression. Furthermore, benchmarking with fio does not show a difference between patch versions that take on file size extension a) an exclusive lock and b) a shared lock. A possible example of an issue with i_size extending writes are write error cases. Some writes might succeed and others might fail for file system internal reasons - for example ENOSPACE. With parallel file size extending writes it _might_ be difficult to revert the action of the failing write, especially to restore the right i_size. With these changes, we allow non-extending parallel direct writes on the same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES. If this flag is set on the file (flag is passed from libfuse to fuse kernel as part of file open/create), we do not take exclusive lock anymore, but instead use a shared lock that allows non-extending writes to run in parallel. FUSE implementations which rely on this inode lock for serialization can continue to do so and serialized direct writes are still the default. Implementations that do not do write serialization need to be updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file open/create reply. On patch review there were concerns that network file systems (or vfs multiple mounts of the same file system) might have issues with parallel writes. We believe this is not the case, as this is just a local lock, which network file systems could not rely on anyway. I.e. this lock is just for local consistency. Signed-off-by: Dharmendra Singh <dsingh@ddn.com> Signed-off-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-06-17 07:10:27 +00:00
#define FOPEN_PARALLEL_DIRECT_WRITES (1 << 6)
#define FOPEN_PASSTHROUGH (1 << 7)
/**
* INIT request/reply flags
*
* FUSE_ASYNC_READ: asynchronous read requests
* FUSE_POSIX_LOCKS: remote locking for POSIX file locks
* FUSE_FILE_OPS: kernel sends file handle for fstat, etc... (not yet supported)
* FUSE_ATOMIC_O_TRUNC: handles the O_TRUNC open flag in the filesystem
* FUSE_EXPORT_SUPPORT: filesystem handles lookups of "." and ".."
* FUSE_BIG_WRITES: filesystem can handle write size larger than 4kB
* FUSE_DONT_MASK: don't apply umask to file mode on create operations
* FUSE_SPLICE_WRITE: kernel supports splice write on the device
* FUSE_SPLICE_MOVE: kernel supports splice move on the device
* FUSE_SPLICE_READ: kernel supports splice read on the device
* FUSE_FLOCK_LOCKS: remote locking for BSD style file locks
* FUSE_HAS_IOCTL_DIR: kernel supports ioctl on directories
* FUSE_AUTO_INVAL_DATA: automatically invalidate cached pages
* FUSE_DO_READDIRPLUS: do READDIRPLUS (READDIR+LOOKUP in one)
* FUSE_READDIRPLUS_AUTO: adaptive readdirplus
* FUSE_ASYNC_DIO: asynchronous direct I/O submission
* FUSE_WRITEBACK_CACHE: use writeback cache for buffered writes
* FUSE_NO_OPEN_SUPPORT: kernel supports zero-message opens
* FUSE_PARALLEL_DIROPS: allow parallel lookups and readdir
* FUSE_HANDLE_KILLPRIV: fs handles killing suid/sgid/cap on write/chown/trunc
* FUSE_POSIX_ACL: filesystem supports posix acls
* FUSE_ABORT_ERROR: reading the device after abort returns ECONNABORTED
fuse: add max_pages to init_out Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to improve performance. Old RFC with detailed description of the problem and many fixes by Mitsuo Hayasaka (mitsuo.hayasaka.hu@hitachi.com): - https://lkml.org/lkml/2012/7/5/136 We've encountered performance degradation and fixed it on a big and complex virtual environment. Environment to reproduce degradation and improvement: 1. Add lag to user mode FUSE Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in passthrough_fh.c 2. patch UM fuse with configurable max_pages parameter. The patch will be provided latter. 3. run test script and perform test on tmpfs fuse_test() { cd /tmp mkdir -p fusemnt passthrough_fh -o max_pages=$1 /tmp/fusemnt grep fuse /proc/self/mounts dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \ count=1K bs=1M 2>&1 | grep -v records rm fusemnt/tmp/tmp killall passthrough_fh } Test results: passthrough_fh /tmp/fusemnt fuse.passthrough_fh \ rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0 1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s passthrough_fh /tmp/fusemnt fuse.passthrough_fh \ rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0 1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s Obviously with bigger lag the difference between 'before' and 'after' will be more significant. Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136), observed improvement from 400-550 to 520-740. Signed-off-by: Constantine Shulyupin <const@MakeLinux.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-06 12:37:06 +00:00
* FUSE_MAX_PAGES: init_out.max_pages contains the max number of req pages
* FUSE_CACHE_SYMLINKS: cache READLINK responses
* FUSE_NO_OPENDIR_SUPPORT: kernel supports zero-message opendir
fuse: allow filesystems to have precise control over data cache On networked filesystems file data can be changed externally. FUSE provides notification messages for filesystem to inform kernel that metadata or data region of a file needs to be invalidated in local page cache. That provides the basis for filesystem implementations to invalidate kernel cache explicitly based on observed filesystem-specific events. FUSE has also "automatic" invalidation mode(*) when the kernel automatically invalidates data cache of a file if it sees mtime change. It also automatically invalidates whole data cache of a file if it sees file size being changed. The automatic mode has corresponding capability - FUSE_AUTO_INVAL_DATA. However, due to probably historical reason, that capability controls only whether mtime change should be resulting in automatic invalidation or not. A change in file size always results in invalidating whole data cache of a file irregardless of whether FUSE_AUTO_INVAL_DATA was negotiated(+). The filesystem I write[1] represents data arrays stored in networked database as local files suitable for mmap. It is read-only filesystem - changes to data are committed externally via database interfaces and the filesystem only glues data into contiguous file streams suitable for mmap and traditional array processing. The files are big - starting from hundreds gigabytes and more. The files change regularly, and frequently by data being appended to their end. The size of files thus changes frequently. If a file was accessed locally and some part of its data got into page cache, we want that data to stay cached unless there is memory pressure, or unless corresponding part of the file was actually changed. However current FUSE behaviour - when it sees file size change - is to invalidate the whole file. The data cache of the file is thus completely lost even on small size change, and despite that the filesystem server is careful to accurately translate database changes into FUSE invalidation messages to kernel. Let's fix it: if a filesystem, through new FUSE_EXPLICIT_INVAL_DATA capability, indicates to kernel that it is fully responsible for data cache invalidation, then the kernel won't invalidate files data cache on size change and only truncate that cache to new size in case the size decreased. (*) see 72d0d248ca "fuse: add FUSE_AUTO_INVAL_DATA init flag", eed2179efe "fuse: invalidate inode mapping if mtime changes" (+) in writeback mode the kernel does not invalidate data cache on file size change, but neither it allows the filesystem to set the size due to external event (see 8373200b12 "fuse: Trust kernel i_size only") [1] https://lab.nexedi.com/kirr/wendelin.core/blob/a50f1d9f/wcfs/wcfs.go#L20 Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-03-27 11:14:15 +00:00
* FUSE_EXPLICIT_INVAL_DATA: only invalidate cached pages on explicit request
* FUSE_MAP_ALIGNMENT: init_out.map_alignment contains log2(byte alignment) for
* foffset and moffset fields in struct
* fuse_setupmapping_out and fuse_removemapping_one.
* FUSE_SUBMOUNTS: kernel supports auto-mounting directory submounts
* FUSE_HANDLE_KILLPRIV_V2: fs kills suid/sgid/cap on write/chown/trunc.
* Upon write/truncate suid/sgid is only killed if caller
* does not have CAP_FSETID. Additionally upon
* write/truncate sgid is killed only if file has group
* execute permission. (Same as Linux VFS behavior).
* FUSE_SETXATTR_EXT: Server supports extended struct fuse_setxattr_in
* FUSE_INIT_EXT: extended fuse_init_in request
* FUSE_INIT_RESERVED: reserved, do not use
fuse: send security context of inode on file When a new inode is created, send its security context to server along with creation request (FUSE_CREAT, FUSE_MKNOD, FUSE_MKDIR and FUSE_SYMLINK). This gives server an opportunity to create new file and set security context (possibly atomically). In all the configurations it might not be possible to set context atomically. Like nfs and ceph, use security_dentry_init_security() to dermine security context of inode and send it with create, mkdir, mknod, and symlink requests. Following is the information sent to server. fuse_sectx_header, fuse_secctx, xattr_name, security_context - struct fuse_secctx_header This contains total number of security contexts being sent and total size of all the security contexts (including size of fuse_secctx_header). - struct fuse_secctx This contains size of security context which follows this structure. There is one fuse_secctx instance per security context. - xattr name string This string represents name of xattr which should be used while setting security context. - security context This is the actual security context whose size is specified in fuse_secctx struct. Also add the FUSE_SECURITY_CTX flag for the `flags` field of the fuse_init_out struct. When this flag is set the kernel will append the security context for a newly created inode to the request (create, mkdir, mknod, and symlink). The server is responsible for ensuring that the inode appears atomically (preferrably) with the requested security context. For example, If the server is using SELinux and backed by a "real" linux file system that supports extended attributes it can write the security context value to /proc/thread-self/attr/fscreate before making the syscall to create the inode. This patch is based on patch from Chirantan Ekbote <chirantan@chromium.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-11 14:32:49 +00:00
* FUSE_SECURITY_CTX: add security context to create, mkdir, symlink, and
* mknod
* FUSE_HAS_INODE_DAX: use per inode DAX
* FUSE_CREATE_SUPP_GROUP: add supplementary group info to create, mkdir,
* symlink and mknod (single group that matches parent)
* FUSE_HAS_EXPIRE_ONLY: kernel supports expiry-only entry invalidation
* FUSE_DIRECT_IO_ALLOW_MMAP: allow shared mmap in FOPEN_DIRECT_IO mode.
fuse: add support for explicit export disabling open_by_handle_at(2) can fail with -ESTALE with a valid handle returned by a previous name_to_handle_at(2) for evicted fuse inodes, which is especially common when entry_valid_timeout is 0, e.g. when the fuse daemon is in "cache=none" mode. The time sequence is like: name_to_handle_at(2) # succeed evict fuse inode open_by_handle_at(2) # fail The root cause is that, with 0 entry_valid_timeout, the dput() called in name_to_handle_at(2) will trigger iput -> evict(), which will send FUSE_FORGET to the daemon. The following open_by_handle_at(2) will send a new FUSE_LOOKUP request upon inode cache miss since the previous inode eviction. Then the fuse daemon may fail the FUSE_LOOKUP request with -ENOENT as the cached metadata of the requested inode has already been cleaned up during the previous FUSE_FORGET. The returned -ENOENT is treated as -ESTALE when open_by_handle_at(2) returns. This confuses the application somehow, as open_by_handle_at(2) fails when the previous name_to_handle_at(2) succeeds. The returned errno is also confusing as the requested file is not deleted and already there. It is reasonable to fail name_to_handle_at(2) early in this case, after which the application can fallback to open(2) to access files. Since this issue typically appears when entry_valid_timeout is 0 which is configured by the fuse daemon, the fuse daemon is the right person to explicitly disable the export when required. Also considering FUSE_EXPORT_SUPPORT actually indicates the support for lookups of "." and "..", and there are existing fuse daemons supporting export without FUSE_EXPORT_SUPPORT set, for compatibility, we add a new INIT flag for such purpose. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-26 03:54:35 +00:00
* FUSE_NO_EXPORT_SUPPORT: explicitly disable export support
* FUSE_HAS_RESEND: kernel supports resending pending requests, and the high bit
* of the request ID indicates resend requests
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
#define FUSE_FILE_OPS (1 << 2)
#define FUSE_ATOMIC_O_TRUNC (1 << 3)
#define FUSE_EXPORT_SUPPORT (1 << 4)
#define FUSE_BIG_WRITES (1 << 5)
#define FUSE_DONT_MASK (1 << 6)
#define FUSE_SPLICE_WRITE (1 << 7)
#define FUSE_SPLICE_MOVE (1 << 8)
#define FUSE_SPLICE_READ (1 << 9)
#define FUSE_FLOCK_LOCKS (1 << 10)
#define FUSE_HAS_IOCTL_DIR (1 << 11)
#define FUSE_AUTO_INVAL_DATA (1 << 12)
#define FUSE_DO_READDIRPLUS (1 << 13)
#define FUSE_READDIRPLUS_AUTO (1 << 14)
#define FUSE_ASYNC_DIO (1 << 15)
#define FUSE_WRITEBACK_CACHE (1 << 16)
#define FUSE_NO_OPEN_SUPPORT (1 << 17)
#define FUSE_PARALLEL_DIROPS (1 << 18)
#define FUSE_HANDLE_KILLPRIV (1 << 19)
#define FUSE_POSIX_ACL (1 << 20)
#define FUSE_ABORT_ERROR (1 << 21)
fuse: add max_pages to init_out Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to improve performance. Old RFC with detailed description of the problem and many fixes by Mitsuo Hayasaka (mitsuo.hayasaka.hu@hitachi.com): - https://lkml.org/lkml/2012/7/5/136 We've encountered performance degradation and fixed it on a big and complex virtual environment. Environment to reproduce degradation and improvement: 1. Add lag to user mode FUSE Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in passthrough_fh.c 2. patch UM fuse with configurable max_pages parameter. The patch will be provided latter. 3. run test script and perform test on tmpfs fuse_test() { cd /tmp mkdir -p fusemnt passthrough_fh -o max_pages=$1 /tmp/fusemnt grep fuse /proc/self/mounts dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \ count=1K bs=1M 2>&1 | grep -v records rm fusemnt/tmp/tmp killall passthrough_fh } Test results: passthrough_fh /tmp/fusemnt fuse.passthrough_fh \ rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0 1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s passthrough_fh /tmp/fusemnt fuse.passthrough_fh \ rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0 1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s Obviously with bigger lag the difference between 'before' and 'after' will be more significant. Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136), observed improvement from 400-550 to 520-740. Signed-off-by: Constantine Shulyupin <const@MakeLinux.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-06 12:37:06 +00:00
#define FUSE_MAX_PAGES (1 << 22)
#define FUSE_CACHE_SYMLINKS (1 << 23)
#define FUSE_NO_OPENDIR_SUPPORT (1 << 24)
fuse: allow filesystems to have precise control over data cache On networked filesystems file data can be changed externally. FUSE provides notification messages for filesystem to inform kernel that metadata or data region of a file needs to be invalidated in local page cache. That provides the basis for filesystem implementations to invalidate kernel cache explicitly based on observed filesystem-specific events. FUSE has also "automatic" invalidation mode(*) when the kernel automatically invalidates data cache of a file if it sees mtime change. It also automatically invalidates whole data cache of a file if it sees file size being changed. The automatic mode has corresponding capability - FUSE_AUTO_INVAL_DATA. However, due to probably historical reason, that capability controls only whether mtime change should be resulting in automatic invalidation or not. A change in file size always results in invalidating whole data cache of a file irregardless of whether FUSE_AUTO_INVAL_DATA was negotiated(+). The filesystem I write[1] represents data arrays stored in networked database as local files suitable for mmap. It is read-only filesystem - changes to data are committed externally via database interfaces and the filesystem only glues data into contiguous file streams suitable for mmap and traditional array processing. The files are big - starting from hundreds gigabytes and more. The files change regularly, and frequently by data being appended to their end. The size of files thus changes frequently. If a file was accessed locally and some part of its data got into page cache, we want that data to stay cached unless there is memory pressure, or unless corresponding part of the file was actually changed. However current FUSE behaviour - when it sees file size change - is to invalidate the whole file. The data cache of the file is thus completely lost even on small size change, and despite that the filesystem server is careful to accurately translate database changes into FUSE invalidation messages to kernel. Let's fix it: if a filesystem, through new FUSE_EXPLICIT_INVAL_DATA capability, indicates to kernel that it is fully responsible for data cache invalidation, then the kernel won't invalidate files data cache on size change and only truncate that cache to new size in case the size decreased. (*) see 72d0d248ca "fuse: add FUSE_AUTO_INVAL_DATA init flag", eed2179efe "fuse: invalidate inode mapping if mtime changes" (+) in writeback mode the kernel does not invalidate data cache on file size change, but neither it allows the filesystem to set the size due to external event (see 8373200b12 "fuse: Trust kernel i_size only") [1] https://lab.nexedi.com/kirr/wendelin.core/blob/a50f1d9f/wcfs/wcfs.go#L20 Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-03-27 11:14:15 +00:00
#define FUSE_EXPLICIT_INVAL_DATA (1 << 25)
#define FUSE_MAP_ALIGNMENT (1 << 26)
#define FUSE_SUBMOUNTS (1 << 27)
#define FUSE_HANDLE_KILLPRIV_V2 (1 << 28)
#define FUSE_SETXATTR_EXT (1 << 29)
#define FUSE_INIT_EXT (1 << 30)
#define FUSE_INIT_RESERVED (1 << 31)
/* bits 32..63 get shifted down 32 bits into the flags2 field */
fuse: send security context of inode on file When a new inode is created, send its security context to server along with creation request (FUSE_CREAT, FUSE_MKNOD, FUSE_MKDIR and FUSE_SYMLINK). This gives server an opportunity to create new file and set security context (possibly atomically). In all the configurations it might not be possible to set context atomically. Like nfs and ceph, use security_dentry_init_security() to dermine security context of inode and send it with create, mkdir, mknod, and symlink requests. Following is the information sent to server. fuse_sectx_header, fuse_secctx, xattr_name, security_context - struct fuse_secctx_header This contains total number of security contexts being sent and total size of all the security contexts (including size of fuse_secctx_header). - struct fuse_secctx This contains size of security context which follows this structure. There is one fuse_secctx instance per security context. - xattr name string This string represents name of xattr which should be used while setting security context. - security context This is the actual security context whose size is specified in fuse_secctx struct. Also add the FUSE_SECURITY_CTX flag for the `flags` field of the fuse_init_out struct. When this flag is set the kernel will append the security context for a newly created inode to the request (create, mkdir, mknod, and symlink). The server is responsible for ensuring that the inode appears atomically (preferrably) with the requested security context. For example, If the server is using SELinux and backed by a "real" linux file system that supports extended attributes it can write the security context value to /proc/thread-self/attr/fscreate before making the syscall to create the inode. This patch is based on patch from Chirantan Ekbote <chirantan@chromium.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-11 14:32:49 +00:00
#define FUSE_SECURITY_CTX (1ULL << 32)
#define FUSE_HAS_INODE_DAX (1ULL << 33)
#define FUSE_CREATE_SUPP_GROUP (1ULL << 34)
#define FUSE_HAS_EXPIRE_ONLY (1ULL << 35)
#define FUSE_DIRECT_IO_ALLOW_MMAP (1ULL << 36)
#define FUSE_PASSTHROUGH (1ULL << 37)
fuse: add support for explicit export disabling open_by_handle_at(2) can fail with -ESTALE with a valid handle returned by a previous name_to_handle_at(2) for evicted fuse inodes, which is especially common when entry_valid_timeout is 0, e.g. when the fuse daemon is in "cache=none" mode. The time sequence is like: name_to_handle_at(2) # succeed evict fuse inode open_by_handle_at(2) # fail The root cause is that, with 0 entry_valid_timeout, the dput() called in name_to_handle_at(2) will trigger iput -> evict(), which will send FUSE_FORGET to the daemon. The following open_by_handle_at(2) will send a new FUSE_LOOKUP request upon inode cache miss since the previous inode eviction. Then the fuse daemon may fail the FUSE_LOOKUP request with -ENOENT as the cached metadata of the requested inode has already been cleaned up during the previous FUSE_FORGET. The returned -ENOENT is treated as -ESTALE when open_by_handle_at(2) returns. This confuses the application somehow, as open_by_handle_at(2) fails when the previous name_to_handle_at(2) succeeds. The returned errno is also confusing as the requested file is not deleted and already there. It is reasonable to fail name_to_handle_at(2) early in this case, after which the application can fallback to open(2) to access files. Since this issue typically appears when entry_valid_timeout is 0 which is configured by the fuse daemon, the fuse daemon is the right person to explicitly disable the export when required. Also considering FUSE_EXPORT_SUPPORT actually indicates the support for lookups of "." and "..", and there are existing fuse daemons supporting export without FUSE_EXPORT_SUPPORT set, for compatibility, we add a new INIT flag for such purpose. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-26 03:54:35 +00:00
#define FUSE_NO_EXPORT_SUPPORT (1ULL << 38)
#define FUSE_HAS_RESEND (1ULL << 39)
/* Obsolete alias for FUSE_DIRECT_IO_ALLOW_MMAP */
#define FUSE_DIRECT_IO_RELAX FUSE_DIRECT_IO_ALLOW_MMAP
/**
* CUSE INIT request/reply flags
*
* CUSE_UNRESTRICTED_IOCTL: use unrestricted ioctl
*/
#define CUSE_UNRESTRICTED_IOCTL (1 << 0)
/**
* Release flags
*/
#define FUSE_RELEASE_FLUSH (1 << 0)
#define FUSE_RELEASE_FLOCK_UNLOCK (1 << 1)
/**
* Getattr flags
*/
#define FUSE_GETATTR_FH (1 << 0)
/**
* Lock flags
*/
#define FUSE_LK_FLOCK (1 << 0)
/**
* WRITE flags
*
* FUSE_WRITE_CACHE: delayed write from page cache, file handle is guessed
* FUSE_WRITE_LOCKOWNER: lock_owner field is valid
* FUSE_WRITE_KILL_SUIDGID: kill suid and sgid bits
*/
#define FUSE_WRITE_CACHE (1 << 0)
#define FUSE_WRITE_LOCKOWNER (1 << 1)
#define FUSE_WRITE_KILL_SUIDGID (1 << 2)
/* Obsolete alias; this flag implies killing suid/sgid only. */
#define FUSE_WRITE_KILL_PRIV FUSE_WRITE_KILL_SUIDGID
/**
* Read flags
*/
#define FUSE_READ_LOCKOWNER (1 << 1)
/**
* Ioctl flags
*
* FUSE_IOCTL_COMPAT: 32bit compat ioctl on 64bit machine
* FUSE_IOCTL_UNRESTRICTED: not restricted to well-formed ioctls, retry allowed
* FUSE_IOCTL_RETRY: retry with new iovecs
* FUSE_IOCTL_32BIT: 32bit ioctl
* FUSE_IOCTL_DIR: is a directory
* FUSE_IOCTL_COMPAT_X32: x32 compat ioctl on 64bit machine (64bit time_t)
*
* FUSE_IOCTL_MAX_IOV: maximum of in_iovecs + out_iovecs
*/
#define FUSE_IOCTL_COMPAT (1 << 0)
#define FUSE_IOCTL_UNRESTRICTED (1 << 1)
#define FUSE_IOCTL_RETRY (1 << 2)
#define FUSE_IOCTL_32BIT (1 << 3)
#define FUSE_IOCTL_DIR (1 << 4)
#define FUSE_IOCTL_COMPAT_X32 (1 << 5)
#define FUSE_IOCTL_MAX_IOV 256
/**
* Poll flags
*
* FUSE_POLL_SCHEDULE_NOTIFY: request poll notify
*/
#define FUSE_POLL_SCHEDULE_NOTIFY (1 << 0)
/**
* Fsync flags
*
* FUSE_FSYNC_FDATASYNC: Sync data only, not metadata
*/
#define FUSE_FSYNC_FDATASYNC (1 << 0)
/**
* fuse_attr flags
*
* FUSE_ATTR_SUBMOUNT: Object is a submount root
* FUSE_ATTR_DAX: Enable DAX for this file in per inode DAX mode
*/
#define FUSE_ATTR_SUBMOUNT (1 << 0)
#define FUSE_ATTR_DAX (1 << 1)
/**
* Open flags
* FUSE_OPEN_KILL_SUIDGID: Kill suid and sgid if executable
*/
#define FUSE_OPEN_KILL_SUIDGID (1 << 0)
/**
* setxattr flags
* FUSE_SETXATTR_ACL_KILL_SGID: Clear SGID when system.posix_acl_access is set
*/
#define FUSE_SETXATTR_ACL_KILL_SGID (1 << 0)
/**
* notify_inval_entry flags
* FUSE_EXPIRE_ONLY
*/
#define FUSE_EXPIRE_ONLY (1 << 0)
/**
* extension type
* FUSE_MAX_NR_SECCTX: maximum value of &fuse_secctx_header.nr_secctx
* FUSE_EXT_GROUPS: &fuse_supp_groups extension
*/
enum fuse_ext_type {
/* Types 0..31 are reserved for fuse_secctx_header */
FUSE_MAX_NR_SECCTX = 31,
FUSE_EXT_GROUPS = 32,
};
enum fuse_opcode {
FUSE_LOOKUP = 1,
FUSE_FORGET = 2, /* no reply */
FUSE_GETATTR = 3,
FUSE_SETATTR = 4,
FUSE_READLINK = 5,
FUSE_SYMLINK = 6,
FUSE_MKNOD = 8,
FUSE_MKDIR = 9,
FUSE_UNLINK = 10,
FUSE_RMDIR = 11,
FUSE_RENAME = 12,
FUSE_LINK = 13,
FUSE_OPEN = 14,
FUSE_READ = 15,
FUSE_WRITE = 16,
FUSE_STATFS = 17,
FUSE_RELEASE = 18,
FUSE_FSYNC = 20,
FUSE_SETXATTR = 21,
FUSE_GETXATTR = 22,
FUSE_LISTXATTR = 23,
FUSE_REMOVEXATTR = 24,
FUSE_FLUSH = 25,
FUSE_INIT = 26,
FUSE_OPENDIR = 27,
FUSE_READDIR = 28,
FUSE_RELEASEDIR = 29,
FUSE_FSYNCDIR = 30,
FUSE_GETLK = 31,
FUSE_SETLK = 32,
FUSE_SETLKW = 33,
FUSE_ACCESS = 34,
FUSE_CREATE = 35,
FUSE_INTERRUPT = 36,
FUSE_BMAP = 37,
FUSE_DESTROY = 38,
FUSE_IOCTL = 39,
FUSE_POLL = 40,
FUSE_NOTIFY_REPLY = 41,
FUSE_BATCH_FORGET = 42,
FUSE_FALLOCATE = 43,
FUSE_READDIRPLUS = 44,
FUSE_RENAME2 = 45,
FUSE_LSEEK = 46,
FUSE_COPY_FILE_RANGE = 47,
FUSE_SETUPMAPPING = 48,
FUSE_REMOVEMAPPING = 49,
virtiofs: propagate sync() to file server Even if POSIX doesn't mandate it, linux users legitimately expect sync() to flush all data and metadata to physical storage when it is located on the same system. This isn't happening with virtiofs though: sync() inside the guest returns right away even though data still needs to be flushed from the host page cache. This is easily demonstrated by doing the following in the guest: $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync 5120+0 records in 5120+0 records out 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s sync() = 0 <0.024068> and start the following in the host when the 'dd' command completes in the guest: $ strace -T -e fsync /usr/bin/sync virtiofs/foo fsync(3) = 0 <10.371640> There are no good reasons not to honor the expected behavior of sync() actually: it gives an unrealistic impression that virtiofs is super fast and that data has safely landed on HW, which isn't the case obviously. Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS request type for this purpose. Provision a 64-bit placeholder for possible future extensions. Since the file server cannot handle the wait == 0 case, we skip it to avoid a gratuitous roundtrip. Note that this is per-superblock: a FUSE_SYNCFS is send for the root mount and for each submount. Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in the file server is treated as permanent success. This ensures compatibility with older file servers: the client will get the current behavior of sync() not being propagated to the file server. Note that such an operation allows the file server to DoS sync(). Since a typical FUSE file server is an untrusted piece of software running in userspace, this is disabled by default. Only enable it with virtiofs for now since virtiofsd is supposedly trusted by the guest kernel. Reported-by: Robert Krawitz <rlk@redhat.com> Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-05-20 15:46:54 +00:00
FUSE_SYNCFS = 50,
FUSE_TMPFILE = 51,
FUSE_STATX = 52,
/* CUSE specific operations */
CUSE_INIT = 4096,
/* Reserved opcodes: helpful to detect structure endian-ness */
CUSE_INIT_BSWAP_RESERVED = 1048576, /* CUSE_INIT << 8 */
FUSE_INIT_BSWAP_RESERVED = 436207616, /* FUSE_INIT << 24 */
};
enum fuse_notify_code {
FUSE_NOTIFY_POLL = 1,
FUSE_NOTIFY_INVAL_INODE = 2,
FUSE_NOTIFY_INVAL_ENTRY = 3,
FUSE_NOTIFY_STORE = 4,
FUSE_NOTIFY_RETRIEVE = 5,
FUSE: Notifying the kernel of deletion. Allows a FUSE file-system to tell the kernel when a file or directory is deleted. If the specified dentry has the specified inode number, the kernel will unhash it. The current 'fuse_notify_inval_entry' does not cause the kernel to clean up directories that are in use properly, and as a result the users of those directories see incorrect semantics from the file-system. The error condition seen when 'fuse_notify_inval_entry' is used to notify of a deleted directory is avoided when 'fuse_notify_delete' is used instead. The following scenario demonstrates the difference: 1. User A chdirs into 'testdir' and starts reading 'testfile'. 2. User B rm -rf 'testdir'. 3. User B creates 'testdir'. 4. User C chdirs into 'testdir'. If you run the above within the same machine on any file-system (including fuse file-systems), there is no problem: user C is able to chdir into the new testdir. The old testdir is removed from the dentry tree, but still open by user A. If operations 2 and 3 are performed via the network such that the fuse file-system uses one of the notify functions to tell the kernel that the nodes are gone, then the following error occurs for user C while user A holds the original directory open: muirj@empacher:~> ls /test/testdir ls: cannot access /test/testdir: No such file or directory The issue here is that the kernel still has a dentry for testdir, and so it is requesting the attributes for the old directory, while the file-system is responding that the directory no longer exists. If on the other hand, if the file-system can notify the kernel that the directory is deleted using the new 'fuse_notify_delete' function, then the above ls will find the new directory as expected. Signed-off-by: John Muir <john@jmuir.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2011-12-06 20:50:06 +00:00
FUSE_NOTIFY_DELETE = 6,
FUSE_NOTIFY_RESEND = 7,
FUSE_NOTIFY_CODE_MAX,
};
/* The read buffer is required to be at least 8k, but may be much larger */
#define FUSE_MIN_READ_BUFFER 8192
#define FUSE_COMPAT_ENTRY_OUT_SIZE 120
struct fuse_entry_out {
uint64_t nodeid; /* Inode ID */
uint64_t generation; /* Inode generation: nodeid:gen must
be unique for the fs's lifetime */
uint64_t entry_valid; /* Cache timeout for the name */
uint64_t attr_valid; /* Cache timeout for the attributes */
uint32_t entry_valid_nsec;
uint32_t attr_valid_nsec;
struct fuse_attr attr;
};
struct fuse_forget_in {
uint64_t nlookup;
};
struct fuse_forget_one {
uint64_t nodeid;
uint64_t nlookup;
};
struct fuse_batch_forget_in {
uint32_t count;
uint32_t dummy;
};
struct fuse_getattr_in {
uint32_t getattr_flags;
uint32_t dummy;
uint64_t fh;
};
#define FUSE_COMPAT_ATTR_OUT_SIZE 96
struct fuse_attr_out {
uint64_t attr_valid; /* Cache timeout for the attributes */
uint32_t attr_valid_nsec;
uint32_t dummy;
struct fuse_attr attr;
};
struct fuse_statx_in {
uint32_t getattr_flags;
uint32_t reserved;
uint64_t fh;
uint32_t sx_flags;
uint32_t sx_mask;
};
struct fuse_statx_out {
uint64_t attr_valid; /* Cache timeout for the attributes */
uint32_t attr_valid_nsec;
uint32_t flags;
uint64_t spare[2];
struct fuse_statx stat;
};
#define FUSE_COMPAT_MKNOD_IN_SIZE 8
struct fuse_mknod_in {
uint32_t mode;
uint32_t rdev;
uint32_t umask;
uint32_t padding;
};
struct fuse_mkdir_in {
uint32_t mode;
uint32_t umask;
};
struct fuse_rename_in {
uint64_t newdir;
};
struct fuse_rename2_in {
uint64_t newdir;
uint32_t flags;
uint32_t padding;
};
struct fuse_link_in {
uint64_t oldnodeid;
};
struct fuse_setattr_in {
uint32_t valid;
uint32_t padding;
uint64_t fh;
uint64_t size;
uint64_t lock_owner;
uint64_t atime;
uint64_t mtime;
uint64_t ctime;
uint32_t atimensec;
uint32_t mtimensec;
uint32_t ctimensec;
uint32_t mode;
uint32_t unused4;
uint32_t uid;
uint32_t gid;
uint32_t unused5;
};
struct fuse_open_in {
uint32_t flags;
uint32_t open_flags; /* FUSE_OPEN_... */
};
struct fuse_create_in {
uint32_t flags;
uint32_t mode;
uint32_t umask;
uint32_t open_flags; /* FUSE_OPEN_... */
};
struct fuse_open_out {
uint64_t fh;
uint32_t open_flags;
int32_t backing_id;
};
struct fuse_release_in {
uint64_t fh;
uint32_t flags;
uint32_t release_flags;
uint64_t lock_owner;
};
struct fuse_flush_in {
uint64_t fh;
uint32_t unused;
uint32_t padding;
uint64_t lock_owner;
};
struct fuse_read_in {
uint64_t fh;
uint64_t offset;
uint32_t size;
uint32_t read_flags;
uint64_t lock_owner;
uint32_t flags;
uint32_t padding;
};
#define FUSE_COMPAT_WRITE_IN_SIZE 24
struct fuse_write_in {
uint64_t fh;
uint64_t offset;
uint32_t size;
uint32_t write_flags;
uint64_t lock_owner;
uint32_t flags;
uint32_t padding;
};
struct fuse_write_out {
uint32_t size;
uint32_t padding;
};
#define FUSE_COMPAT_STATFS_SIZE 48
struct fuse_statfs_out {
struct fuse_kstatfs st;
};
struct fuse_fsync_in {
uint64_t fh;
uint32_t fsync_flags;
uint32_t padding;
};
#define FUSE_COMPAT_SETXATTR_IN_SIZE 8
struct fuse_setxattr_in {
uint32_t size;
uint32_t flags;
uint32_t setxattr_flags;
uint32_t padding;
};
struct fuse_getxattr_in {
uint32_t size;
uint32_t padding;
};
struct fuse_getxattr_out {
uint32_t size;
uint32_t padding;
};
struct fuse_lk_in {
uint64_t fh;
uint64_t owner;
struct fuse_file_lock lk;
uint32_t lk_flags;
uint32_t padding;
};
struct fuse_lk_out {
struct fuse_file_lock lk;
};
struct fuse_access_in {
uint32_t mask;
uint32_t padding;
};
struct fuse_init_in {
uint32_t major;
uint32_t minor;
uint32_t max_readahead;
uint32_t flags;
uint32_t flags2;
uint32_t unused[11];
};
#define FUSE_COMPAT_INIT_OUT_SIZE 8
#define FUSE_COMPAT_22_INIT_OUT_SIZE 24
struct fuse_init_out {
uint32_t major;
uint32_t minor;
uint32_t max_readahead;
uint32_t flags;
uint16_t max_background;
uint16_t congestion_threshold;
uint32_t max_write;
uint32_t time_gran;
fuse: add max_pages to init_out Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to improve performance. Old RFC with detailed description of the problem and many fixes by Mitsuo Hayasaka (mitsuo.hayasaka.hu@hitachi.com): - https://lkml.org/lkml/2012/7/5/136 We've encountered performance degradation and fixed it on a big and complex virtual environment. Environment to reproduce degradation and improvement: 1. Add lag to user mode FUSE Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in passthrough_fh.c 2. patch UM fuse with configurable max_pages parameter. The patch will be provided latter. 3. run test script and perform test on tmpfs fuse_test() { cd /tmp mkdir -p fusemnt passthrough_fh -o max_pages=$1 /tmp/fusemnt grep fuse /proc/self/mounts dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \ count=1K bs=1M 2>&1 | grep -v records rm fusemnt/tmp/tmp killall passthrough_fh } Test results: passthrough_fh /tmp/fusemnt fuse.passthrough_fh \ rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0 1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s passthrough_fh /tmp/fusemnt fuse.passthrough_fh \ rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0 1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s Obviously with bigger lag the difference between 'before' and 'after' will be more significant. Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136), observed improvement from 400-550 to 520-740. Signed-off-by: Constantine Shulyupin <const@MakeLinux.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-06 12:37:06 +00:00
uint16_t max_pages;
uint16_t map_alignment;
uint32_t flags2;
uint32_t max_stack_depth;
uint32_t unused[6];
};
#define CUSE_INIT_INFO_MAX 4096
struct cuse_init_in {
uint32_t major;
uint32_t minor;
uint32_t unused;
uint32_t flags;
};
struct cuse_init_out {
uint32_t major;
uint32_t minor;
uint32_t unused;
uint32_t flags;
uint32_t max_read;
uint32_t max_write;
uint32_t dev_major; /* chardev major */
uint32_t dev_minor; /* chardev minor */
uint32_t spare[10];
};
struct fuse_interrupt_in {
uint64_t unique;
};
struct fuse_bmap_in {
uint64_t block;
uint32_t blocksize;
uint32_t padding;
};
struct fuse_bmap_out {
uint64_t block;
};
struct fuse_ioctl_in {
uint64_t fh;
uint32_t flags;
uint32_t cmd;
uint64_t arg;
uint32_t in_size;
uint32_t out_size;
};
struct fuse_ioctl_iovec {
uint64_t base;
uint64_t len;
};
struct fuse_ioctl_out {
int32_t result;
uint32_t flags;
uint32_t in_iovs;
uint32_t out_iovs;
};
struct fuse_poll_in {
uint64_t fh;
uint64_t kh;
uint32_t flags;
uint32_t events;
};
struct fuse_poll_out {
uint32_t revents;
uint32_t padding;
};
struct fuse_notify_poll_wakeup_out {
uint64_t kh;
};
struct fuse_fallocate_in {
uint64_t fh;
uint64_t offset;
uint64_t length;
uint32_t mode;
uint32_t padding;
};
/**
* FUSE request unique ID flag
*
* Indicates whether this is a resend request. The receiver should handle this
* request accordingly.
*/
#define FUSE_UNIQUE_RESEND (1ULL << 63)
struct fuse_in_header {
uint32_t len;
uint32_t opcode;
uint64_t unique;
uint64_t nodeid;
uint32_t uid;
uint32_t gid;
uint32_t pid;
uint16_t total_extlen; /* length of extensions in 8byte units */
uint16_t padding;
};
struct fuse_out_header {
uint32_t len;
int32_t error;
uint64_t unique;
};
struct fuse_dirent {
uint64_t ino;
uint64_t off;
uint32_t namelen;
uint32_t type;
char name[];
};
fuse: send security context of inode on file When a new inode is created, send its security context to server along with creation request (FUSE_CREAT, FUSE_MKNOD, FUSE_MKDIR and FUSE_SYMLINK). This gives server an opportunity to create new file and set security context (possibly atomically). In all the configurations it might not be possible to set context atomically. Like nfs and ceph, use security_dentry_init_security() to dermine security context of inode and send it with create, mkdir, mknod, and symlink requests. Following is the information sent to server. fuse_sectx_header, fuse_secctx, xattr_name, security_context - struct fuse_secctx_header This contains total number of security contexts being sent and total size of all the security contexts (including size of fuse_secctx_header). - struct fuse_secctx This contains size of security context which follows this structure. There is one fuse_secctx instance per security context. - xattr name string This string represents name of xattr which should be used while setting security context. - security context This is the actual security context whose size is specified in fuse_secctx struct. Also add the FUSE_SECURITY_CTX flag for the `flags` field of the fuse_init_out struct. When this flag is set the kernel will append the security context for a newly created inode to the request (create, mkdir, mknod, and symlink). The server is responsible for ensuring that the inode appears atomically (preferrably) with the requested security context. For example, If the server is using SELinux and backed by a "real" linux file system that supports extended attributes it can write the security context value to /proc/thread-self/attr/fscreate before making the syscall to create the inode. This patch is based on patch from Chirantan Ekbote <chirantan@chromium.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-11 14:32:49 +00:00
/* Align variable length records to 64bit boundary */
#define FUSE_REC_ALIGN(x) \
(((x) + sizeof(uint64_t) - 1) & ~(sizeof(uint64_t) - 1))
fuse: send security context of inode on file When a new inode is created, send its security context to server along with creation request (FUSE_CREAT, FUSE_MKNOD, FUSE_MKDIR and FUSE_SYMLINK). This gives server an opportunity to create new file and set security context (possibly atomically). In all the configurations it might not be possible to set context atomically. Like nfs and ceph, use security_dentry_init_security() to dermine security context of inode and send it with create, mkdir, mknod, and symlink requests. Following is the information sent to server. fuse_sectx_header, fuse_secctx, xattr_name, security_context - struct fuse_secctx_header This contains total number of security contexts being sent and total size of all the security contexts (including size of fuse_secctx_header). - struct fuse_secctx This contains size of security context which follows this structure. There is one fuse_secctx instance per security context. - xattr name string This string represents name of xattr which should be used while setting security context. - security context This is the actual security context whose size is specified in fuse_secctx struct. Also add the FUSE_SECURITY_CTX flag for the `flags` field of the fuse_init_out struct. When this flag is set the kernel will append the security context for a newly created inode to the request (create, mkdir, mknod, and symlink). The server is responsible for ensuring that the inode appears atomically (preferrably) with the requested security context. For example, If the server is using SELinux and backed by a "real" linux file system that supports extended attributes it can write the security context value to /proc/thread-self/attr/fscreate before making the syscall to create the inode. This patch is based on patch from Chirantan Ekbote <chirantan@chromium.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-11 14:32:49 +00:00
#define FUSE_NAME_OFFSET offsetof(struct fuse_dirent, name)
#define FUSE_DIRENT_ALIGN(x) FUSE_REC_ALIGN(x)
#define FUSE_DIRENT_SIZE(d) \
FUSE_DIRENT_ALIGN(FUSE_NAME_OFFSET + (d)->namelen)
struct fuse_direntplus {
struct fuse_entry_out entry_out;
struct fuse_dirent dirent;
};
#define FUSE_NAME_OFFSET_DIRENTPLUS \
offsetof(struct fuse_direntplus, dirent.name)
#define FUSE_DIRENTPLUS_SIZE(d) \
FUSE_DIRENT_ALIGN(FUSE_NAME_OFFSET_DIRENTPLUS + (d)->dirent.namelen)
struct fuse_notify_inval_inode_out {
uint64_t ino;
int64_t off;
int64_t len;
};
struct fuse_notify_inval_entry_out {
uint64_t parent;
uint32_t namelen;
uint32_t flags;
};
FUSE: Notifying the kernel of deletion. Allows a FUSE file-system to tell the kernel when a file or directory is deleted. If the specified dentry has the specified inode number, the kernel will unhash it. The current 'fuse_notify_inval_entry' does not cause the kernel to clean up directories that are in use properly, and as a result the users of those directories see incorrect semantics from the file-system. The error condition seen when 'fuse_notify_inval_entry' is used to notify of a deleted directory is avoided when 'fuse_notify_delete' is used instead. The following scenario demonstrates the difference: 1. User A chdirs into 'testdir' and starts reading 'testfile'. 2. User B rm -rf 'testdir'. 3. User B creates 'testdir'. 4. User C chdirs into 'testdir'. If you run the above within the same machine on any file-system (including fuse file-systems), there is no problem: user C is able to chdir into the new testdir. The old testdir is removed from the dentry tree, but still open by user A. If operations 2 and 3 are performed via the network such that the fuse file-system uses one of the notify functions to tell the kernel that the nodes are gone, then the following error occurs for user C while user A holds the original directory open: muirj@empacher:~> ls /test/testdir ls: cannot access /test/testdir: No such file or directory The issue here is that the kernel still has a dentry for testdir, and so it is requesting the attributes for the old directory, while the file-system is responding that the directory no longer exists. If on the other hand, if the file-system can notify the kernel that the directory is deleted using the new 'fuse_notify_delete' function, then the above ls will find the new directory as expected. Signed-off-by: John Muir <john@jmuir.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2011-12-06 20:50:06 +00:00
struct fuse_notify_delete_out {
uint64_t parent;
uint64_t child;
uint32_t namelen;
uint32_t padding;
FUSE: Notifying the kernel of deletion. Allows a FUSE file-system to tell the kernel when a file or directory is deleted. If the specified dentry has the specified inode number, the kernel will unhash it. The current 'fuse_notify_inval_entry' does not cause the kernel to clean up directories that are in use properly, and as a result the users of those directories see incorrect semantics from the file-system. The error condition seen when 'fuse_notify_inval_entry' is used to notify of a deleted directory is avoided when 'fuse_notify_delete' is used instead. The following scenario demonstrates the difference: 1. User A chdirs into 'testdir' and starts reading 'testfile'. 2. User B rm -rf 'testdir'. 3. User B creates 'testdir'. 4. User C chdirs into 'testdir'. If you run the above within the same machine on any file-system (including fuse file-systems), there is no problem: user C is able to chdir into the new testdir. The old testdir is removed from the dentry tree, but still open by user A. If operations 2 and 3 are performed via the network such that the fuse file-system uses one of the notify functions to tell the kernel that the nodes are gone, then the following error occurs for user C while user A holds the original directory open: muirj@empacher:~> ls /test/testdir ls: cannot access /test/testdir: No such file or directory The issue here is that the kernel still has a dentry for testdir, and so it is requesting the attributes for the old directory, while the file-system is responding that the directory no longer exists. If on the other hand, if the file-system can notify the kernel that the directory is deleted using the new 'fuse_notify_delete' function, then the above ls will find the new directory as expected. Signed-off-by: John Muir <john@jmuir.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2011-12-06 20:50:06 +00:00
};
struct fuse_notify_store_out {
uint64_t nodeid;
uint64_t offset;
uint32_t size;
uint32_t padding;
};
struct fuse_notify_retrieve_out {
uint64_t notify_unique;
uint64_t nodeid;
uint64_t offset;
uint32_t size;
uint32_t padding;
};
/* Matches the size of fuse_write_in */
struct fuse_notify_retrieve_in {
uint64_t dummy1;
uint64_t offset;
uint32_t size;
uint32_t dummy2;
uint64_t dummy3;
uint64_t dummy4;
};
struct fuse_backing_map {
int32_t fd;
uint32_t flags;
uint64_t padding;
};
/* Device ioctls: */
#define FUSE_DEV_IOC_MAGIC 229
#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
#define FUSE_DEV_IOC_BACKING_OPEN _IOW(FUSE_DEV_IOC_MAGIC, 1, \
struct fuse_backing_map)
#define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
struct fuse_lseek_in {
uint64_t fh;
uint64_t offset;
uint32_t whence;
uint32_t padding;
};
struct fuse_lseek_out {
uint64_t offset;
};
struct fuse_copy_file_range_in {
uint64_t fh_in;
uint64_t off_in;
uint64_t nodeid_out;
uint64_t fh_out;
uint64_t off_out;
uint64_t len;
uint64_t flags;
};
#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
#define FUSE_SETUPMAPPING_FLAG_READ (1ull << 1)
struct fuse_setupmapping_in {
/* An already open handle */
uint64_t fh;
/* Offset into the file to start the mapping */
uint64_t foffset;
/* Length of mapping required */
uint64_t len;
/* Flags, FUSE_SETUPMAPPING_FLAG_* */
uint64_t flags;
/* Offset in Memory Window */
uint64_t moffset;
};
struct fuse_removemapping_in {
/* number of fuse_removemapping_one follows */
uint32_t count;
};
struct fuse_removemapping_one {
/* Offset into the dax window start the unmapping */
uint64_t moffset;
/* Length of mapping required */
uint64_t len;
};
#define FUSE_REMOVEMAPPING_MAX_ENTRY \
(PAGE_SIZE / sizeof(struct fuse_removemapping_one))
virtiofs: propagate sync() to file server Even if POSIX doesn't mandate it, linux users legitimately expect sync() to flush all data and metadata to physical storage when it is located on the same system. This isn't happening with virtiofs though: sync() inside the guest returns right away even though data still needs to be flushed from the host page cache. This is easily demonstrated by doing the following in the guest: $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync 5120+0 records in 5120+0 records out 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s sync() = 0 <0.024068> and start the following in the host when the 'dd' command completes in the guest: $ strace -T -e fsync /usr/bin/sync virtiofs/foo fsync(3) = 0 <10.371640> There are no good reasons not to honor the expected behavior of sync() actually: it gives an unrealistic impression that virtiofs is super fast and that data has safely landed on HW, which isn't the case obviously. Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS request type for this purpose. Provision a 64-bit placeholder for possible future extensions. Since the file server cannot handle the wait == 0 case, we skip it to avoid a gratuitous roundtrip. Note that this is per-superblock: a FUSE_SYNCFS is send for the root mount and for each submount. Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in the file server is treated as permanent success. This ensures compatibility with older file servers: the client will get the current behavior of sync() not being propagated to the file server. Note that such an operation allows the file server to DoS sync(). Since a typical FUSE file server is an untrusted piece of software running in userspace, this is disabled by default. Only enable it with virtiofs for now since virtiofsd is supposedly trusted by the guest kernel. Reported-by: Robert Krawitz <rlk@redhat.com> Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-05-20 15:46:54 +00:00
struct fuse_syncfs_in {
uint64_t padding;
};
fuse: send security context of inode on file When a new inode is created, send its security context to server along with creation request (FUSE_CREAT, FUSE_MKNOD, FUSE_MKDIR and FUSE_SYMLINK). This gives server an opportunity to create new file and set security context (possibly atomically). In all the configurations it might not be possible to set context atomically. Like nfs and ceph, use security_dentry_init_security() to dermine security context of inode and send it with create, mkdir, mknod, and symlink requests. Following is the information sent to server. fuse_sectx_header, fuse_secctx, xattr_name, security_context - struct fuse_secctx_header This contains total number of security contexts being sent and total size of all the security contexts (including size of fuse_secctx_header). - struct fuse_secctx This contains size of security context which follows this structure. There is one fuse_secctx instance per security context. - xattr name string This string represents name of xattr which should be used while setting security context. - security context This is the actual security context whose size is specified in fuse_secctx struct. Also add the FUSE_SECURITY_CTX flag for the `flags` field of the fuse_init_out struct. When this flag is set the kernel will append the security context for a newly created inode to the request (create, mkdir, mknod, and symlink). The server is responsible for ensuring that the inode appears atomically (preferrably) with the requested security context. For example, If the server is using SELinux and backed by a "real" linux file system that supports extended attributes it can write the security context value to /proc/thread-self/attr/fscreate before making the syscall to create the inode. This patch is based on patch from Chirantan Ekbote <chirantan@chromium.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-11 14:32:49 +00:00
/*
* For each security context, send fuse_secctx with size of security context
* fuse_secctx will be followed by security context name and this in turn
* will be followed by actual context label.
* fuse_secctx, name, context
*/
struct fuse_secctx {
uint32_t size;
uint32_t padding;
};
/*
* Contains the information about how many fuse_secctx structures are being
* sent and what's the total size of all security contexts (including
* size of fuse_secctx_header).
*
*/
struct fuse_secctx_header {
uint32_t size;
uint32_t nr_secctx;
};
/**
* struct fuse_ext_header - extension header
* @size: total size of this extension including this header
* @type: type of extension
*
* This is made compatible with fuse_secctx_header by using type values >
* FUSE_MAX_NR_SECCTX
*/
struct fuse_ext_header {
uint32_t size;
uint32_t type;
};
/**
* struct fuse_supp_groups - Supplementary group extension
* @nr_groups: number of supplementary groups
* @groups: flexible array of group IDs
*/
struct fuse_supp_groups {
uint32_t nr_groups;
uint32_t groups[];
};
#endif /* _LINUX_FUSE_H */