Commit Graph

690 Commits

Author SHA1 Message Date
Amir Goldstein 55bf882c7f fanotify: fix merging marks masks with FAN_ONDIR
Change the logic of FAN_ONDIR in two ways that are similar to the logic
of FAN_EVENT_ON_CHILD, that was fixed in commit 54a307ba8d ("fanotify:
fix logic of events on child"):

1. The flag is meaningless in ignore mask
2. The flag refers only to events in the mask of the mark where it is set

This is what the fanotify_mark.2 man page says about FAN_ONDIR:
"Without this flag, only events for files are created."  It doesn't
say anything about setting this flag in ignore mask to stop getting
events on directories nor can I think of any setup where this capability
would be useful.

Currently, when marks masks are merged, the FAN_ONDIR flag set in one
mark affects the events that are set in another mark's mask and this
behavior causes unexpected results.  For example, a user adds a mark on a
directory with mask FAN_ATTRIB | FAN_ONDIR and a mount mark with mask
FAN_OPEN (without FAN_ONDIR).  An opendir() of that directory (which is
inside that mount) generates a FAN_OPEN event even though neither of the
marks requested to get open events on directories.

Link: https://lore.kernel.org/r/20200319151022.31456-10-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-24 12:06:32 +01:00
Amir Goldstein f367a62a7c fanotify: merge duplicate events on parent and child
With inotify, when a watch is set on a directory and on its child, an
event on the child is reported twice, once with wd of the parent watch
and once with wd of the child watch without the filename.

With fanotify, when a watch is set on a directory and on its child, an
event on the child is reported twice, but it has the exact same
information - either an open file descriptor of the child or an encoded
fid of the child.

The reason that the two identical events are not merged is because the
object id used for merging events in the queue is the child inode in one
event and parent inode in the other.

For events with path or dentry data, use the victim inode instead of the
watched inode as the object id for event merging, so that the event
reported on parent will be merged with the event reported on the child.

Link: https://lore.kernel.org/r/20200319151022.31456-9-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-24 11:29:12 +01:00
Amir Goldstein dfc2d2594e fsnotify: replace inode pointer with an object id
The event inode field is used only for comparison in queue merges and
cannot be dereferenced after handle_event(), because it does not hold a
refcount on the inode.

Replace it with an abstract id to do the same thing.

Link: https://lore.kernel.org/r/20200319151022.31456-8-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-24 11:28:00 +01:00
Amir Goldstein 017de65fe5 fsnotify: simplify arguments passing to fsnotify_parent()
Instead of passing both dentry and path and having to figure out which
one to use, pass data/data_type to simplify the code.

Link: https://lore.kernel.org/r/20200319151022.31456-6-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-23 18:22:48 +01:00
Amir Goldstein aa93bdc550 fsnotify: use helpers to access data by data_type
Create helpers to access path and inode from different data types.

Link: https://lore.kernel.org/r/20200319151022.31456-5-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-23 18:19:06 +01:00
Eric Sandeen 1edc8eb2e9 fs: call fsnotify_sb_delete after evict_inodes
When a filesystem is unmounted, we currently call fsnotify_sb_delete()
before evict_inodes(), which means that fsnotify_unmount_inodes()
must iterate over all inodes on the superblock looking for any inodes
with watches.  This is inefficient and can lead to livelocks as it
iterates over many unwatched inodes.

At this point, SB_ACTIVE is gone and dropping refcount to zero kicks
the inode out out immediately, so anything processed by
fsnotify_sb_delete / fsnotify_unmount_inodes gets evicted in that loop.

After that, the call to evict_inodes will evict everything else with a
zero refcount.

This should speed things up overall, and avoid livelocks in
fsnotify_unmount_inodes().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-18 00:03:01 -05:00
Eric Sandeen 04646aebd3 fs: avoid softlockups in s_inodes iterators
Anything that walks all inodes on sb->s_inodes list without rescheduling
risks softlockups.

Previous efforts were made in 2 functions, see:

c27d82f fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
ac05fbb inode: don't softlockup when evicting inodes

but there hasn't been an audit of all walkers, so do that now.  This
also consistently moves the cond_resched() calls to the bottom of each
loop in cases where it already exists.

One loop remains: remove_dquot_ref(), because I'm not quite sure how
to deal with that one w/o taking the i_lock.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-18 00:03:01 -05:00
Linus Torvalds 0da522107e compat_ioctl: remove most of fs/compat_ioctl.c
As part of the cleanup of some remaining y2038 issues, I came to
 fs/compat_ioctl.c, which still has a couple of commands that need support
 for time64_t.
 
 In completely unrelated work, I spent time on cleaning up parts of this
 file in the past, moving things out into drivers instead.
 
 After Al Viro reviewed an earlier version of this series and did a lot
 more of that cleanup, I decided to try to completely eliminate the rest
 of it and move it all into drivers.
 
 This series incorporates some of Al's work and many patches of my own,
 but in the end stops short of actually removing the last part, which is
 the scsi ioctl handlers. I have patches for those as well, but they need
 more testing or possibly a rewrite.
 
 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJdsHCdAAoJEJpsee/mABjZtYkP/1JGl3jFv3Iq/5BCdPkaePP1
 RtMJRNfURgK3GeuHUui330PvVjI/pLWXU/VXMK2MPTASpJLzYz3uCaZrpVWEMpDZ
 +ImzGmgJkITlW1uWU3zOcQhOxTyb1hCZ0Ci+2xn9QAmyOL7prXoXCXDWv3h6iyiF
 lwG+nW+HNtyx41YG+9bRfKNoG0ZJ+nkJ70BV6u0acQHXWn7Xuupa9YUmBL87hxAL
 6dlJfLTJg6q8QSv/Q6LxslfWk2Ti8OOJZOwtFM5R8Bgl0iUcvshiRCKfv/3t9jXD
 dJNvF1uq8z+gracWK49Qsfq5dnZ2ZxHFUo9u0NjbCrxNvWH/sdvhbaUBuJI75seH
 VIznCkdxFhrqitJJ8KmxANxG08u+9zSKjSlxG2SmlA4qFx/AoStoHwQXcogJscNb
 YIXYKmWBvwPzYu09QFAXdHFPmZvp/3HhMWU6o92lvDhsDwzkSGt3XKhCJea4DCaT
 m+oCcoACqSWhMwdbJOEFofSub4bY43s5iaYuKes+c8O261/Dwg6v/pgIVez9mxXm
 TBnvCsotq5m8wbwzv99eFqGeJH8zpDHrXxEtRR5KQqMqjLq/OQVaEzmpHZTEuK7n
 e/V/PAKo2/V63g4k6GApQXDxnjwT+m0aWToWoeEzPYXS6KmtWC91r4bWtslu3rdl
 bN65armTm7bFFR32Avnu
 =lgCl
 -----END PGP SIGNATURE-----

Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground

Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
 "As part of the cleanup of some remaining y2038 issues, I came to
  fs/compat_ioctl.c, which still has a couple of commands that need
  support for time64_t.

  In completely unrelated work, I spent time on cleaning up parts of
  this file in the past, moving things out into drivers instead.

  After Al Viro reviewed an earlier version of this series and did a lot
  more of that cleanup, I decided to try to completely eliminate the
  rest of it and move it all into drivers.

  This series incorporates some of Al's work and many patches of my own,
  but in the end stops short of actually removing the last part, which
  is the scsi ioctl handlers. I have patches for those as well, but they
  need more testing or possibly a rewrite"

* tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
  scsi: sd: enable compat ioctls for sed-opal
  pktcdvd: add compat_ioctl handler
  compat_ioctl: move SG_GET_REQUEST_TABLE handling
  compat_ioctl: ppp: move simple commands into ppp_generic.c
  compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
  compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
  compat_ioctl: unify copy-in of ppp filters
  tty: handle compat PPP ioctls
  compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
  compat_ioctl: handle SIOCOUTQNSD
  af_unix: add compat_ioctl support
  compat_ioctl: reimplement SG_IO handling
  compat_ioctl: move WDIOC handling into wdt drivers
  fs: compat_ioctl: move FITRIM emulation into file systems
  gfs2: add compat_ioctl support
  compat_ioctl: remove unused convert_in_user macro
  compat_ioctl: remove last RAID handling code
  compat_ioctl: remove /dev/raw ioctl translation
  compat_ioctl: remove PCI ioctl translation
  compat_ioctl: remove joystick ioctl translation
  ...
2019-12-01 13:46:15 -08:00
Arnd Bergmann 1832f2d8ff compat_ioctl: move more drivers to compat_ptr_ioctl
The .ioctl and .compat_ioctl file operations have the same prototype so
they can both point to the same function, which works great almost all
the time when all the commands are compatible.

One exception is the s390 architecture, where a compat pointer is only
31 bit wide, and converting it into a 64-bit pointer requires calling
compat_ptr(). Most drivers here will never run in s390, but since we now
have a generic helper for it, it's easy enough to use it consistently.

I double-checked all these drivers to ensure that all ioctl arguments
are used as pointers or are ignored, but are not interpreted as integer
values.

Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Darren Hart (VMware) <dvhart@infradead.org>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23 17:23:44 +02:00
Ben Dooks (Codethink) ddd06c36bd fsnotify/fdinfo: exportfs_encode_inode_fh() takes pointer as 4th argument
The call to exportfs_encode_inode_fh() takes an pointer
as the 4th argument, so replace the integer 0 with the
NULL pointer.

This fixes the following sparse warning:

fs/notify/fdinfo.c:53:87: warning: Using plain integer as NULL pointer

Link: https://lore.kernel.org/r/20191016095955.3347-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-10-17 10:32:59 +02:00
Ben Dooks 4a0b20be60 fsnotify: move declaration of fsnotify_mark_connector_cachep to fsnotify.h
Move fsnotify_mark_connector_cachep to fsnotify.h to properly
share it with the user in mark.c and avoid the following warning
from sparse:

fs/notify/mark.c:82:19: warning: symbol 'fsnotify_mark_connector_cachep' was not declared. Should it be static?

Link: https://lore.kernel.org/r/20191015132518.21819-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-10-17 10:31:12 +02:00
Linus Torvalds 298fb76a55 Highlights:
- add a new knfsd file cache, so that we don't have to open and
 	  close on each (NFSv2/v3) READ or WRITE.  This can speed up
 	  read and write in some cases.  It also replaces our readahead
 	  cache.
 	- Prevent silent data loss on write errors, by treating write
 	  errors like server reboots for the purposes of write caching,
 	  thus forcing clients to resend their writes.
 	- Tweak the code that allocates sessions to be more forgiving,
 	  so that NFSv4.1 mounts are less likely to hang when a server
 	  already has a lot of clients.
 	- Eliminate an arbitrary limit on NFSv4 ACL sizes; they should
 	  now be limited only by the backend filesystem and the
 	  maximum RPC size.
 	- Allow the server to enforce use of the correct kerberos
 	  credentials when a client reclaims state after a reboot.
 
 And some miscellaneous smaller bugfixes and cleanup.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl2OoFcVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+dRoP/3OW1NxPjpjbCQWZL0M+O3AYJJla
 W8E+uoZKMosFEe/ymokMD0Vn5s47jPaMCifMjHZa2GygW8zHN9X2v0HURx/lob+o
 /rJXwMn78N/8kdbfDz2FvaCPeT0IuNzRIFBV8/sSXofqwCBwvPo+cl0QGrd4/xLp
 X35qlupx62TRk+kbdRjvv8kpS5SJ7BvR+FSA1WubNYWw2hpdEsr2OCFdGq2Wvthy
 DK6AfGBXfJGsOE+HGCSj6ejRV6i0UOJ17P8gRSsx+YT0DOe5E7ROjt+qvvRwk489
 wmR8Vjuqr1e40eGAUq3xuLfk5F5NgycY4ekVxk/cTVFNwWcz2DfdjXQUlyPAbrSD
 SqIyxN1qdKT24gtr7AHOXUWJzBYPWDgObCVBXUGzyL81RiDdhf38HRNjL2TcSDld
 tzCjQ0wbPw+iT74v6qQRY05oS+h3JOtDjU4pxsBnxVtNn4WhGJtaLfWW8o1C1QwU
 bc4aX3TlYhDmzU7n7Zjt4rFXGJfyokM+o6tPao1Z60Pmsv1gOk4KQlzLtW/jPHx4
 ZwYTwVQUKRDBfC62nmgqDyGI3/Qu11FuIxL2KXUCgkwDxNWN4YkwYjOGw9Lb5qKM
 wFpxq6CDNZB/IWLEu8Yg85kDPPUJMoI657lZb7Osr/MfBpU0YljcMOIzLBy8uV1u
 9COUbPaQipiWGu/0
 =diBo
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Highlights:

   - Add a new knfsd file cache, so that we don't have to open and close
     on each (NFSv2/v3) READ or WRITE. This can speed up read and write
     in some cases. It also replaces our readahead cache.

   - Prevent silent data loss on write errors, by treating write errors
     like server reboots for the purposes of write caching, thus forcing
     clients to resend their writes.

   - Tweak the code that allocates sessions to be more forgiving, so
     that NFSv4.1 mounts are less likely to hang when a server already
     has a lot of clients.

   - Eliminate an arbitrary limit on NFSv4 ACL sizes; they should now be
     limited only by the backend filesystem and the maximum RPC size.

   - Allow the server to enforce use of the correct kerberos credentials
     when a client reclaims state after a reboot.

  And some miscellaneous smaller bugfixes and cleanup"

* tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux: (34 commits)
  sunrpc: clean up indentation issue
  nfsd: fix nfs read eof detection
  nfsd: Make nfsd_reset_boot_verifier_locked static
  nfsd: degraded slot-count more gracefully as allocation nears exhaustion.
  nfsd: handle drc over-allocation gracefully.
  nfsd: add support for upcall version 2
  nfsd: add a "GetVersion" upcall for nfsdcld
  nfsd: Reset the boot verifier on all write I/O errors
  nfsd: Don't garbage collect files that might contain write errors
  nfsd: Support the server resetting the boot verifier
  nfsd: nfsd_file cache entries should be per net namespace
  nfsd: eliminate an unnecessary acl size limit
  Deprecate nfsd fault injection
  nfsd: remove duplicated include from filecache.c
  nfsd: Fix the documentation for svcxdr_tmpalloc()
  nfsd: Fix up some unused variable warnings
  nfsd: close cached files prior to a REMOVE or RENAME that would replace target
  nfsd: rip out the raparms cache
  nfsd: have nfsd_test_lock use the nfsd_file cache
  nfsd: hook up nfs4_preprocess_stateid_op to the nfsd_file cache
  ...
2019-09-27 17:00:27 -07:00
Linus Torvalds 5825a95fe9 selinux/stable-5.4 PR 20190917
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl2BLvcUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXP9pA/+Ls9sRGZoEipycbgRnwkL9/6yFtn4
 UCFGMP0eobrjL82i8uMOa/72Budsp3ZaZRxf36NpbMDPyB9ohp5jf7o1WFTELESv
 EwxVvOMNwrxO2UbzRv3iywnhdPVJ4gHPa4GWfBHu2EEfhz3/Bv0tPIBdeXAbq4aC
 R0p+M9X0FFEp9eP4ftwOvFGpbZ8zKo1kwgdvCnqLhHDkyqtapqO/ByCTe1VATERP
 fyxjYDZNnITmI0plaIxCeeudklOTtVSAL4JPh1rk8rZIkUznZ4EBDHxdKiaz3j9C
 ZtAthiAA9PfAwf4DZSPHnGsfINxeNBKLD65jZn/PUne/gNJEx4DK041X9HXBNwjv
 OoArw58LCzxtTNZ//WB4CovRpeSdKvmKv0oh61k8cdQahLeHhzXE1wLQbnnBJLI3
 CTsumIp4ZPEOX5r4ogdS3UIQpo3KrZump7VO85yUTRni150JpZR3egYpmcJ0So1A
 QTPemBhC2CHJVTpycYZ9fVTlPeC4oNwosPmvpB8XeGu3w5JpuNSId+BDR/ZlQAmq
 xWiIocGL3UMuPuJUrTGChifqBAgzK+gLa7S7RYPEnTCkj6LVQwsuP4gBXf75QTG4
 FPwVcoMSDFxUDF0oFqwz4GfJlCxBSzX+BkWUn6jIiXKXBnQjU+1gu6KTwE25mf/j
 snJznFk25hFYFaM=
 =n4ht
 -----END PGP SIGNATURE-----

Merge tag 'selinux-pr-20190917' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux

Pull selinux updates from Paul Moore:

 - Add LSM hooks, and SELinux access control hooks, for dnotify,
   fanotify, and inotify watches. This has been discussed with both the
   LSM and fs/notify folks and everybody is good with these new hooks.

 - The LSM stacking changes missed a few calls to current_security() in
   the SELinux code; we fix those and remove current_security() for
   good.

 - Improve our network object labeling cache so that we always return
   the object's label, even when under memory pressure. Previously we
   would return an error if we couldn't allocate a new cache entry, now
   we always return the label even if we can't create a new cache entry
   for it.

 - Convert the sidtab atomic_t counter to a normal u32 with
   READ/WRITE_ONCE() and memory barrier protection.

 - A few patches to policydb.c to clean things up (remove forward
   declarations, long lines, bad variable names, etc)

* tag 'selinux-pr-20190917' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  lsm: remove current_security()
  selinux: fix residual uses of current_security() for the SELinux blob
  selinux: avoid atomic_t usage in sidtab
  fanotify, inotify, dnotify, security: add security hook for fs notifications
  selinux: always return a secid from the network caches if we find one
  selinux: policydb - rename type_val_to_struct_array
  selinux: policydb - fix some checkpatch.pl warnings
  selinux: shuffle around policydb.c to get rid of forward declarations
2019-09-23 11:21:04 -07:00
Trond Myklebust b72679ee89 notify: export symbols for use by the knfsd file cache
The knfsd file cache will need to detect when files are unlinked, so that
it can close the associated cached files. Export a minimal set of notifier
functions to allow it to do so.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-08-19 11:00:39 -04:00
Aaron Goidel ac5656d8a4 fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.

In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.

This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.

Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.

The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.

The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.

The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.

Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.

Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 17:45:39 -04:00
Matteo Croce eec4844fae proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range.  This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data                                         old     new   delta
    sysctl_vals                                    -      12     +12
    __kstrtab_sysctl_vals                          -      12     +12
    max                                           14      10      -4
    int_max                                       16       -     -16
    one                                           68       -     -68
    zero                                         128      28    -100
    Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
  Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
  Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:07 -07:00
Shakeel Butt ec16545096 memcg, fsnotify: no oom-kill for remote memcg charging
Commit d46eb14b73 ("fs: fsnotify: account fsnotify metadata to
kmemcg") added remote memcg charging for fanotify and inotify event
objects.  The aim was to charge the memory to the listener who is
interested in the events but without triggering the OOM killer.
Otherwise there would be security concerns for the listener.

At the time, oom-kill trigger was not in the charging path.  A parallel
work added the oom-kill back to charging path i.e.  commit 29ef680ae7
("memcg, oom: move out_of_memory back to the charge path").  So to not
trigger oom-killer in the remote memcg, explicitly add
__GFP_RETRY_MAYFAIL to the fanotigy and inotify event allocations.

Link: http://lkml.kernel.org/r/20190514212259.156585-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:43 -07:00
Linus Torvalds e6983afd92 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl0kWgwACgkQnJ2qBz9k
 QNkkdggA7bdy6xZRPdumZMxtASGDs1JJ4diNs+apgyc6wUfsT1lCE2ap20EdzfzK
 drAvlJt1vYEW+6apOzUXJ0qWXMVRzy4XRl+jVMO9GW6BoY4OyJQ86AQZlEv1zZ4n
 vxeYnlbxA7JyfkWgup0ZSb5EKRSO1eSxZKEZou0wu2jRCRr/E5RyjPQHXaiE5ihc
 7ilEtTI3Qg3nnAK30F0Iy0X3lGqgXj+rlJ0TgR8BBEDllct2wV16vvMl/Sy+BXip
 5sSWjSy8zntMnkSN8yH/oJN0D+fqmCsnYafwqTpPek8izvEz4xpjshbWTDnPm0HM
 eiMC1U3ZJoD3Z4/wxRZ91m60VYgJBA==
 =SVKR
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:
 "This contains cleanups of the fsnotify name removal hook and also a
  patch to disable fanotify permission events for 'proc' filesystem"

* tag 'fsnotify_for_v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: get rid of fsnotify_nameremove()
  fsnotify: move fsnotify_nameremove() hook out of d_delete()
  configfs: call fsnotify_rmdir() hook
  debugfs: call fsnotify_{unlink,rmdir}() hooks
  debugfs: simplify __debugfs_remove_file()
  devpts: call fsnotify_unlink() hook
  tracefs: call fsnotify_{unlink,rmdir}() hooks
  rpc_pipefs: call fsnotify_{unlink,rmdir}() hooks
  btrfs: call fsnotify_rmdir() hook
  fsnotify: add empty fsnotify_{unlink,rmdir}() hooks
  fanotify: Disallow permission events for proc filesystem
2019-07-10 20:09:17 -07:00
Amir Goldstein 7377f5bec1 fsnotify: get rid of fsnotify_nameremove()
For all callers of fsnotify_{unlink,rmdir}(), we made sure that d_parent
and d_name are stable.  Therefore, fsnotify_{unlink,rmdir}() do not need
the safety measures in fsnotify_nameremove() to stabilize parent and name.
We can now simplify those hooks and get rid of fsnotify_nameremove().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-06-20 14:47:54 +02:00
Amir Goldstein c285a2f01d fanotify: update connector fsid cache on add mark
When implementing connector fsid cache, we only initialized the cache
when the first mark added to object was added by FAN_REPORT_FID group.
We forgot to update conn->fsid when the second mark is added by
FAN_REPORT_FID group to an already attached connector without fsid
cache.

Reported-and-tested-by: syzbot+c277e8e2f46414645508@syzkaller.appspotmail.com
Fixes: 77115225ac ("fanotify: cache fsid in fsnotify_mark_connector")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-06-19 15:53:58 +02:00
Jan Kara 0b3b094ac9 fanotify: Disallow permission events for proc filesystem
Proc filesystem has special locking rules for various files. Thus
fanotify which opens files on event delivery can easily deadlock
against another process that waits for fanotify permission event to be
handled. Since permission events on /proc have doubtful value anyway,
just disallow them.

Link: https://lore.kernel.org/linux-fsdevel/20190320131642.GE9485@quack2.suse.cz/
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-05-28 18:10:07 +02:00
Thomas Gleixner 3e0a4e8580 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 118
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 or at your option any
  later version this program is distributed in the hope that it will
  be useful but without any warranty without even the implied warranty
  of merchantability or fitness for a particular purpose see the gnu
  general public license for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 44 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190523091651.032047323@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:39:02 +02:00
Thomas Gleixner c82ee6d3be treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 18
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 or at your option any
  later version this program is distributed in the hope that it will
  be useful but without any warranty without even the implied warranty
  of merchantability or fitness for a particular purpose see the gnu
  general public license for more details you should have received a
  copy of the gnu general public license along with this program see
  the file copying if not write to the free software foundation 675
  mass ave cambridge ma 02139 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 52 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154042.342335923@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 11:28:46 +02:00
Thomas Gleixner ec8f24b7fa treewide: Add SPDX license identifier - Makefile/Kconfig
Add SPDX license identifiers to all Make/Kconfig files which:

 - Have no license information of any form

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:46 +02:00
Linus Torvalds d4c608115c \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlzZmHgACgkQnJ2qBz9k
 QNleKQf/VglDfB2VvuYnMbZp4YBPZgNMe0KlOTYE60RBg4DGZj04ov0bxSRZccPP
 KvHp1ZyIM4Bp67lKFZoI6bYd5diq7gVjHIdHrFc0LfDmpyaodqxe3bOMWT/TH0YU
 0jjr7MU1wmVRY8UYrIqPIlL6Dl0hlxGf5cYPzoRQB4hzNWEWr9ZEdC1DbKupWTk/
 1GsYtCHZzpZ20YfzDU3jcYaHRVZDwZXb65Wx3OfUVof8q9bkpHd5lhl79jT4yVPS
 yYfqs1CW6ra2zsYDtBdmh46BZ3d36ACfcjmgf2/7GMdQP8Ocv0c5lnSSkrJeC5XO
 GttTFIKU7JWFAqiWmPAoGIchUXNkRA==
 =eSpi
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify fixes from Jan Kara:
 "Two fsnotify fixes"

* tag 'fsnotify_for_v5.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: fix unlink performance regression
  fsnotify: Clarify connector assignment in fsnotify_add_mark_list()
2019-05-13 15:08:16 -07:00
Amir Goldstein 4d8e7055a4 fsnotify: fix unlink performance regression
__fsnotify_parent() has an optimization in place to avoid unneeded
take_dentry_name_snapshot().  When fsnotify_nameremove() was changed
not to call __fsnotify_parent(), we left out the optimization.
Kernel test robot reported a 5% performance regression in concurrent
unlink() workload.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Link: https://lore.kernel.org/lkml/20190505062153.GG29809@shao2-debian/
Link: https://lore.kernel.org/linux-fsdevel/20190104090357.GD22409@quack2.suse.cz/
Fixes: 5f02a87763 ("fsnotify: annotate directory entry modification events")
CC: stable@vger.kernel.org
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-05-09 12:44:00 +02:00
Linus Torvalds d27fb65bc2 Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc dcache updates from Al Viro:
 "Most of this pile is putting name length into struct name_snapshot and
  making use of it.

  The beginning of this series ("ovl_lookup_real_one(): don't bother
  with strlen()") ought to have been split in two (separate switch of
  name_snapshot to struct qstr from overlayfs reaping the trivial
  benefits of that), but I wanted to avoid a rebase - by the time I'd
  spotted that it was (a) in -next and (b) close to 5.1-final ;-/"

* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  audit_compare_dname_path(): switch to const struct qstr *
  audit_update_watch(): switch to const struct qstr *
  inotify_handle_event(): don't bother with strlen()
  fsnotify: switch send_to_group() and ->handle_event to const struct qstr *
  fsnotify(): switch to passing const struct qstr * for file_name
  switch fsnotify_move() to passing const struct qstr * for old_name
  ovl_lookup_real_one(): don't bother with strlen()
  sysv: bury the broken "quietly truncate the long filenames" logics
  nsfs: unobfuscate
  unexport d_alloc_pseudo()
2019-05-07 20:03:32 -07:00
Linus Torvalds eac7078a0f pidfd patches for v5.2-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE7btrcuORLb1XUhEwjrBW1T7ssS0FAlzReuoACgkQjrBW1T7s
 sS1uvBAA16pgnhRNxNTrp3LYft6lUWmF4n0baOTVtQNLhPjpwaOxHIrCBugkQCJB
 QcQ9IQSOvIkaEW0XAQoPBaeLviiKhHOFw1Fv89OtW6xUidSfSV15lcI9f1F2pCm2
 4yCL/8XvL6M0NhxiwftJAkWOXeDNLfjFnLwyLxBfgg3EeyqMgUB8raeosEID0ORR
 gm2/g8DYS2r+KNqM/F4xvMSgabfi2bGk+8BtAaVnftJfstpRNrqKwWnSK3Wspj1l
 5gkb8gSsiY6ns3V6RgNHrFlhevFg8V+VjcJt7FR+aUEjOkcoiXas/PhvamMzdsn/
 FM1F/A0pM8FSybIUClhnnnxNPc+p8ZN/71YQAPs+Mnh3xvbtKea2lkhC+Xv4OpK3
 edutSZWFaiIery82Rk00H3vqiSF1+kRIXSpZSS4mElk4FsVljkyH+nSP7rbmE2MR
 EQe+kKnZl8QzWrVbnODC+EVvvVpA2bXDvENJmvKqus+t2G0OdV7Iku3F5E3KjF8k
 S5RRV1zuBF3ugqnjmYrVmJtpEA8mxClmqvg6okru+qW6ngO5oOgVpPLjWn1CXcdj
 wcuQ6Pe1QwAHS54e9WSWgCHVssLvm9nCdCqypdNaoyGWmbTWntwlrY7Y0JUQnAbB
 6/G/DQQiCWY9y8bMZlTEydhIpgcsdROuPYv+oHF5+eQQthsWwHc=
 =LH11
 -----END PGP SIGNATURE-----

Merge tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull pidfd updates from Christian Brauner:
 "This patchset makes it possible to retrieve pidfds at process creation
  time by introducing the new flag CLONE_PIDFD to the clone() system
  call. Linus originally suggested to implement this as a new flag to
  clone() instead of making it a separate system call.

  After a thorough review from Oleg CLONE_PIDFD returns pidfds in the
  parent_tidptr argument. This means we can give back the associated pid
  and the pidfd at the same time. Access to process metadata information
  thus becomes rather trivial.

  As has been agreed, CLONE_PIDFD creates file descriptors based on
  anonymous inodes similar to the new mount api. They are made
  unconditional by this patchset as they are now needed by core kernel
  code (vfs, pidfd) even more than they already were before (timerfd,
  signalfd, io_uring, epoll etc.). The core patchset is rather small.
  The bulky looking changelist is caused by David's very simple changes
  to Kconfig to make anon inodes unconditional.

  A pidfd comes with additional information in fdinfo if the kernel
  supports procfs. The fdinfo file contains the pid of the process in
  the callers pid namespace in the same format as the procfs status
  file, i.e. "Pid:\t%d".

  To remove worries about missing metadata access this patchset comes
  with a sample/test program that illustrates how a combination of
  CLONE_PIDFD and pidfd_send_signal() can be used to gain race-free
  access to process metadata through /proc/<pid>.

  Further work based on this patchset has been done by Joel. His work
  makes pidfds pollable. It finished too late for this merge window. I
  would prefer to have it sitting in linux-next for a while and send it
  for inclusion during the 5.3 merge window"

* tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  samples: show race-free pidfd metadata access
  signal: support CLONE_PIDFD with pidfd_send_signal
  clone: add CLONE_PIDFD
  Make anon_inodes unconditional
2019-05-07 12:30:24 -07:00
Jan Kara 11a6f8e2db fsnotify: Clarify connector assignment in fsnotify_add_mark_list()
Add a comment explaining why WRITE_ONCE() is enough when setting
mark->connector which can get dereferenced by RCU protected readers.

Signed-off-by: Jan Kara <jack@suse.cz>
2019-05-01 18:05:11 +02:00
Jan Kara b1da6a5187 fsnotify: Fix NULL ptr deref in fanotify_get_fsid()
fanotify_get_fsid() is reading mark->connector->fsid under srcu. It can
happen that it sees mark not fully initialized or mark that is already
detached from the object list. In these cases mark->connector
can be NULL leading to NULL ptr dereference. Fix the problem by
being careful when reading mark->connector and check it for being NULL.
Also use WRITE_ONCE when writing the mark just to prevent compiler from
doing something stupid.

Reported-by: syzbot+15927486a4f1bfcbaf91@syzkaller.appspotmail.com
Fixes: 77115225ac ("fanotify: cache fsid in fsnotify_mark_connector")
Signed-off-by: Jan Kara <jack@suse.cz>
2019-04-28 22:14:50 +02:00
Al Viro ce163918cd inotify_handle_event(): don't bother with strlen()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-26 13:55:21 -04:00
Al Viro e43e9c339a fsnotify: switch send_to_group() and ->handle_event to const struct qstr *
note that conditions surrounding accesses to dname in audit_watch_handle_event()
and audit_mark_handle_event() guarantee that dname won't have been NULL.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-26 13:51:03 -04:00
Al Viro 25b229dff4 fsnotify(): switch to passing const struct qstr * for file_name
Note that in fnsotify_move() and fsnotify_link() we are guaranteed
that dentry->d_name won't change during the fsnotify() evaluation
(by having the parent directory locked exclusive), so we don't
need to fetch dentry->d_name.name in the callers.  In fsnotify_dirent()
the same stability of dentry->d_name is also true, but it's a bit
more convoluted - there is one callchain (devpts_pty_new() ->
fsnotify_create() -> fsnotify_dirent()) where the parent is _not_
locked, but on devpts ->d_name of everything is unchanging; it
has neither explicit nor implicit renames.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-26 13:37:25 -04:00
Al Viro 230c6402b1 ovl_lookup_real_one(): don't bother with strlen()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-26 13:13:33 -04:00
David Howells 5dd50aaeb1
Make anon_inodes unconditional
Make the anon_inodes facility unconditional so that it can be used by core
VFS code and pidfd code.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[christian@brauner.io: adapt commit message to mention pidfds]
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-04-19 14:03:11 +02:00
Jan Kara b2d22b6bb3 fanotify: Allow copying of file handle to userspace
When file handle is embedded inside fanotify_event and usercopy checks
are enabled, we get a warning like:

Bad or missing usercopy whitelist? Kernel memory exposure attempt detected
from SLAB object 'fanotify_event' (offset 40, size 8)!
WARNING: CPU: 1 PID: 7649 at mm/usercopy.c:78 usercopy_warn+0xeb/0x110
mm/usercopy.c:78

Annotate handling in fanotify_event properly to mark copying it to
userspace is fine.

Reported-by: syzbot+2c49971e251e36216d1f@syzkaller.appspotmail.com
Fixes: a8b13aa20a ("fanotify: enable FAN_REPORT_FID init flag")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-03-19 09:29:07 +01:00
ZhangXiaoxu 62c9d2674b inotify: Fix fsnotify_mark refcount leak in inotify_update_existing_watch()
Commit 4d97f7d53d ("inotify: Add flag IN_MASK_CREATE for
inotify_add_watch()") forgot to call fsnotify_put_mark() with
IN_MASK_CREATE after fsnotify_find_mark()

Fixes: 4d97f7d53d ("inotify: Add flag IN_MASK_CREATE for inotify_add_watch()")
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-03-11 10:13:17 +01:00
Jan Kara b519057981 fanotify: Make waits for fanotify events only killable
Making waits for response to fanotify permission events interruptible
can result in EINTR returns from open(2) or other syscalls when there's
e.g. AV software that's monitoring the file. Orion reports that e.g.
bash is complaining like:

bash: /etc/bash_completion.d/itweb-settings.bash: Interrupted system call

So for now convert the wait from interruptible to only killable one.
That is mostly invisible to userspace. Sadly this breaks hibernation
with fanotify permission events pending again but we have to put more
thought into how to fix this without regressing userspace visible
behavior.

Reported-by: Orion Poplawski <orion@nwra.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-21 11:47:23 +01:00
Jan Kara fabf7f29b3 fanotify: Use interruptible wait when waiting for permission events
When waiting for response to fanotify permission events, we currently
use uninterruptible waits. That makes code simple however it can cause
lots of processes to end up in uninterruptible sleep with hard reboot
being the only alternative in case fanotify listener process stops
responding (e.g. due to a bug in its implementation). Uninterruptible
sleep also makes system hibernation fail if the listener gets frozen
before the process generating fanotify permission event.

Fix these problems by using interruptible sleep for waiting for response
to fanotify event. This is slightly tricky though - we have to
detect when the event got already reported to userspace as in that
case we must not free the event. Instead we push the responsibility for
freeing the event to the process that will write response to the
event.

Reported-by: Orion Poplawski <orion@nwra.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18 12:41:16 +01:00
Jan Kara 40873284d7 fanotify: Track permission event state
Track whether permission event got already reported to userspace and
whether userspace already answered to the permission event. Protect
stores to this field together with updates to ->response field by
group->notification_lock. This will allow aborting wait for reply to
permission event from userspace.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18 12:41:16 +01:00
Jan Kara ca6f86998d fanotify: Simplify cleaning of access_list
Simplify iteration cleaning access_list in fanotify_release(). That will
make following changes more obvious.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18 12:41:16 +01:00
Jan Kara f7db89accc fsnotify: Create function to remove event from notification list
Create function to remove event from the notification list. Later it will
be used from more places.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18 12:41:16 +01:00
Jan Kara 8c5544666c fanotify: Move locking inside get_one_event()
get_one_event() has a single caller and that just locks
notification_lock around the call. Move locking inside get_one_event()
as that will make using ->response field for permission event state
easier.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18 12:40:46 +01:00
Jan Kara af6a511306 fanotify: Fold dequeue_event() into process_access_response()
Fold dequeue_event() into process_access_response(). This will make
changes to use of ->response field easier.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18 11:49:36 +01:00
Jan Kara 53136b393c fanotify: Select EXPORTFS
Fanotify now uses exportfs_encode_inode_fh() so it needs to select
EXPORTFS.

Fixes: e9e0c89030 "fanotify: encode file identifier for FAN_REPORT_FID"
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-14 17:46:33 +01:00
Amir Goldstein e7fce6d94c fanotify: report FAN_ONDIR to listener with FAN_REPORT_FID
dirent modification events (create/delete/move) do not carry the
child entry name/inode information. Instead, we report FAN_ONDIR
for mkdir/rmdir so user can differentiate them from creat/unlink.

This is consistent with inotify reporting IN_ISDIR with dirent events
and is useful for implementing recursive directory tree watcher.

We avoid merging dirent events referring to subdirs with dirent events
referring to non subdirs, otherwise, user won't be able to tell from a
mask FAN_CREATE|FAN_DELETE|FAN_ONDIR if it describes mkdir+unlink pair
or rmdir+create pair of events.

For backward compatibility and consistency, do not report FAN_ONDIR
to user in legacy fanotify mode (reporting fd) and report FAN_ONDIR
to user in FAN_REPORT_FID mode for all event types.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:47:32 +01:00
Amir Goldstein 235328d1fa fanotify: add support for create/attrib/move/delete events
Add support for events with data type FSNOTIFY_EVENT_INODE
(e.g. create/attrib/move/delete) for inode and filesystem mark types.

The "inode" events do not carry enough information (i.e. path) to
report event->fd, so we do not allow setting a mask for those events
unless group supports reporting fid.

The "inode" events are not supported on a mount mark, because they do
not carry enough information (i.e. path) to be filtered by mount point.

The "dirent" events (create/move/delete) report the fid of the parent
directory where events took place without specifying the filename of the
child. In the future, fanotify may get support for reporting filename
information for those events.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:43:23 +01:00
Amir Goldstein 83b535d289 fanotify: support events with data type FSNOTIFY_EVENT_INODE
When event data type is FSNOTIFY_EVENT_INODE, we don't have a refernece
to the mount, so we will not be able to open a file descriptor when user
reads the event. However, if the listener has enabled reporting file
identifier with the FAN_REPORT_FID init flag, we allow reporting those
events and we use an identifier inode to encode fid.

The inode to use as identifier when reporting fid depends on the event.
For dirent modification events, we report the modified directory inode
and we report the "victim" inode otherwise.
For example:
FS_ATTRIB reports the child inode even if reported on a watched parent.
FS_CREATE reports the modified dir inode and not the created inode.

[JK: Fixup condition in fanotify_group_event_mask()]

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:36 +01:00
Amir Goldstein 0321e03cb4 fanotify: check FS_ISDIR flag instead of d_is_dir()
All fsnotify hooks set the FS_ISDIR flag for events that happen
on directory victim inodes except for fsnotify_perm().

Add the missing FS_ISDIR flag in fsnotify_perm() hook and let
fanotify_group_event_mask() check the FS_ISDIR flag instead of
checking if path argument is a directory.

This is needed for fanotify support for event types that do not
carry path information.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:36 +01:00
Amir Goldstein 0a20df7ed3 fsnotify: report FS_ISDIR flag with MOVE_SELF and DELETE_SELF events
We need to report FS_ISDIR flag with MOVE_SELF and DELETE_SELF events
for fanotify, because fanotify API requires the user to explicitly
request events on directories by FAN_ONDIR flag.

inotify never reported IN_ISDIR with those events. It looks like an
oversight, but to avoid the risk of breaking existing inotify programs,
mask the FS_ISDIR flag out when reprting those events to inotify backend.

We also add the FS_ISDIR flag with FS_ATTRIB event in the case of rename
over an empty target directory. inotify did not report IN_ISDIR in this
case, but it normally does report IN_ISDIR along with IN_ATTRIB event,
so in this case, we do not mask out the FS_ISDIR flag.

[JK: Simplify the checks in fsnotify_move()]

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:35 +01:00
Amir Goldstein 73072283a2 fanotify: use vfs_get_fsid() helper instead of vfs_statfs()
This is a cleanup that doesn't change any logic.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:35 +01:00
Amir Goldstein 77115225ac fanotify: cache fsid in fsnotify_mark_connector
For FAN_REPORT_FID, we need to encode fid with fsid of the filesystem on
every event. To avoid having to call vfs_statfs() on every event to get
fsid, we store the fsid in fsnotify_mark_connector on the first time we
add a mark and on handle event we use the cached fsid.

Subsequent calls to add mark on the same object are expected to pass the
same fsid, so the call will fail on cached fsid mismatch.

If an event is reported on several mark types (inode, mount, filesystem),
all connectors should already have the same fsid, so we use the cached
fsid from the first connector.

[JK: Simplify code flow around fanotify_get_fid()
     make fsid argument of fsnotify_add_mark_locked() unconditional]

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:35 +01:00
Amir Goldstein a8b13aa20a fanotify: enable FAN_REPORT_FID init flag
When setting up an fanotify listener, user may request to get fid
information in event instead of an open file descriptor.

The fid obtained with event on a watched object contains the file
handle returned by name_to_handle_at(2) and fsid returned by statfs(2).

Restrict FAN_REPORT_FID to class FAN_CLASS_NOTIF, because we have have
no good reason to support reporting fid on permission events.

When setting a mark, we need to make sure that the filesystem
supports encoding file handles with name_to_handle_at(2) and that
statfs(2) encodes a non-zero fsid.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:35 +01:00
Amir Goldstein 5e469c830f fanotify: copy event fid info to user
If group requested FAN_REPORT_FID and event has file identifier,
copy that information to user reading the event after event metadata.

fid information is formatted as struct fanotify_event_info_fid
that includes a generic header struct fanotify_event_info_header,
so that other info types could be defined in the future using the
same header.

metadata->event_len includes the length of the fid information.

The fid information includes the filesystem's fsid (see statfs(2))
followed by an NFS file handle of the file that could be passed as
an argument to open_by_handle_at(2).

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:35 +01:00
Amir Goldstein e9e0c89030 fanotify: encode file identifier for FAN_REPORT_FID
When user requests the flag FAN_REPORT_FID in fanotify_init(),
a unique file identifier of the event target object will be reported
with the event.

The file identifier includes the filesystem's fsid (i.e. from statfs(2))
and an NFS file handle of the file (i.e. from name_to_handle_at(2)).

The file identifier makes holding the path reference and passing a file
descriptor to user redundant, so those are disabled in a group with
FAN_REPORT_FID.

Encode fid and store it in event for a group with FAN_REPORT_FID.
Up to 12 bytes of file handle on 32bit arch (16 bytes on 64bit arch)
are stored inline in fanotify_event struct. Larger file handles are
stored in an external allocated buffer.

On failure to encode fid, we print a warning and queue the event
without the fid information.

[JK: Fold part of later patched into this one to use
exportfs_encode_inode_fh() right away]

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:34 +01:00
Amir Goldstein bb2f7b4542 fanotify: open code fill_event_metadata()
The helper is quite trivial and open coding it will make it easier
to implement copying event fid info to user.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:37:01 +01:00
Amir Goldstein 33913997d5 fanotify: rename struct fanotify_{,perm_}event_info
struct fanotify_event_info "inherits" from struct fsnotify_event and
therefore a more appropriate (and short) name for it is fanotify_event.
Same for struct fanotify_perm_event_info, which now "inherits" from
struct fanotify_event.

We plan to reuse the name struct fanotify_event_info for user visible
event info record format.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-06 15:25:45 +01:00
Amir Goldstein a0a92d261f fsnotify: move mask out of struct fsnotify_event
Common fsnotify_event helpers have no need for the mask field.
It is only used by backend code, so move the field out of the
abstract fsnotify_event struct and into the concrete backend
event structs.

This change packs struct inotify_event_info better on 64bit
machine and will allow us to cram some more fields into
struct fanotify_event_info.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-06 15:25:11 +01:00
Amir Goldstein 45a9fb3725 fsnotify: send all event types to super block marks
So far, existence of super block marks was checked only on events with
data type FSNOTIFY_EVENT_PATH. Use the super block of the "to_tell" inode
to report the events of all event types to super block marks.

This change has no effect on current backends. Soon, this will allow
fanotify backend to receive all event types on a super block mark.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-06 15:20:30 +01:00
Tetsuo Handa 125892edfe inotify: Fix fd refcount leak in inotify_add_watch().
Commit 4d97f7d53d ("inotify: Add flag IN_MASK_CREATE for
inotify_add_watch()") forgot to call fdput() before bailing out.

Fixes: 4d97f7d53d ("inotify: Add flag IN_MASK_CREATE for inotify_add_watch()")
CC: stable@vger.kernel.org
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-01-02 18:28:37 +01:00
Nikolay Borisov ac9498d686 fanotify: Use inode_is_open_for_write
Use the aptly named function rather than opencoding i_writecount check.
No functional changes.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-12-11 10:55:45 +01:00
Kees Cook 5b03a472b4 fanotify: Make sure to check event_len when copying
As a precaution, make sure we check event_len when copying to userspace.
Based on old feedback: https://lkml.kernel.org/r/542D9FE5.3010009@gmx.de

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-12-05 12:47:22 +01:00
Eric Biggers d6f7aa9820 fsnotify/fdinfo: include fdinfo.h for inotify_show_fdinfo()
inotify_show_fdinfo() is defined in fs/notify/fdinfo.c and declared in
fs/notify/fdinfo.h, but the declaration isn't included at the point of
the definition.  Include the header to enforce that the definition
matches the declaration.

This addresses a gcc warning when -Wmissing-prototypes is enabled.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-11-15 17:34:27 +01:00
Matthew Bobrowski 66917a3130 fanotify: introduce new event mask FAN_OPEN_EXEC_PERM
A new event mask FAN_OPEN_EXEC_PERM has been defined. This allows users
to receive events and grant access to files that are intending to be
opened for execution. Events of FAN_OPEN_EXEC_PERM type will be
generated when a file has been opened by using either execve(),
execveat() or uselib() system calls.

This acts in the same manner as previous permission event mask, meaning
that an access response is required from the user application in order
to permit any further operations on the file.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-11-13 18:41:05 +01:00
Matthew Bobrowski 9b076f1c0f fanotify: introduce new event mask FAN_OPEN_EXEC
A new event mask FAN_OPEN_EXEC has been defined so that users have the
ability to receive events specifically when a file has been opened with
the intent to be executed. Events of FAN_OPEN_EXEC type will be
generated when a file has been opened using either execve(), execveat()
or uselib() system calls.

The feature is implemented within fsnotify_open() by generating the
FAN_OPEN_EXEC event type if __FMODE_EXEC is set within file->f_flags.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-11-13 18:41:04 +01:00
Matthew Bobrowski 2d10b23082 fanotify: return only user requested event types in event mask
Modify fanotify_should_send_event() so that it now returns a mask for
an event that contains ONLY flags for the event types that have been
specifically requested by the user. Flags that may have been included
within the event mask, but have not been explicitly requested by the
user will not be present in the returned value.

As an example, given the situation where a user requests events of type
FAN_OPEN. Traditionally, the event mask returned within an event that
occurred on a filesystem object that has been marked for monitoring and is
opened, will only ever have the FAN_OPEN bit set. With the introduction of
the new flags like FAN_OPEN_EXEC, and perhaps any other future event
flags, there is a possibility of the returned event mask containing more
than a single bit set, despite having only requested the single event type.
Prior to these modifications performed to fanotify_should_send_event(), a
user would have received a bundled event mask containing flags FAN_OPEN
and FAN_OPEN_EXEC in the instance that a file was opened for execution via
execve(), for example. This means that a user would receive event types
in the returned event mask that have not been requested. This runs the
possibility of breaking existing systems and causing other unforeseen
issues.

To mitigate this possibility, fanotify_should_send_event() has been
modified to return the event mask containing ONLY event types explicitly
requested by the user. This means that we will NOT report events that the
user did no set a mask for, and we will NOT report events that the user
has set an ignore mask for.

The function name fanotify_should_send_event() has also been updated so
that it's more relevant to what it has been designed to do.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-11-13 18:38:18 +01:00
Amir Goldstein b469e7e47c fanotify: fix handling of events on child sub-directory
When an event is reported on a sub-directory and the parent inode has
a mark mask with FS_EVENT_ON_CHILD|FS_ISDIR, the event will be sent to
fsnotify() even if the event type is not in the parent mark mask
(e.g. FS_OPEN).

Further more, if that event happened on a mount or a filesystem with
a mount/sb mark that does have that event type in their mask, the "on
child" event will be reported on the mount/sb mark.  That is not
desired, because user will get a duplicate event for the same action.

Note that the event reported on the victim inode is never merged with
the event reported on the parent inode, because of the check in
should_merge(): old_fsn->inode == new_fsn->inode.

Fix this by looking for a match of an actual event type (i.e. not just
FS_ISDIR) in parent's inode mark mask and by not reporting an "on child"
event to group if event type is only found on mount/sb marks.

[backport hint: The bug seems to have always been in fanotify, but this
                patch will only apply cleanly to v4.19.y]

Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-11-08 15:43:48 +01:00
Jan Kara 721fb6fbfd fsnotify: Fix busy inodes during unmount
Detaching of mark connector from fsnotify_put_mark() can race with
unmounting of the filesystem like:

  CPU1				CPU2
fsnotify_put_mark()
  spin_lock(&conn->lock);
  ...
  inode = fsnotify_detach_connector_from_object(conn)
  spin_unlock(&conn->lock);
				generic_shutdown_super()
				  fsnotify_unmount_inodes()
				    sees connector detached for inode
				      -> nothing to do
				  evict_inode()
				    barfs on pending inode reference
  iput(inode);

Resulting in "Busy inodes after unmount" message and possible kernel
oops. Make fsnotify_unmount_inodes() properly wait for outstanding inode
references from detached connectors.

Note that the accounting of outstanding inode references in the
superblock can cause some cacheline contention on the counter. OTOH it
happens only during deletion of the last notification mark from an inode
(or during unlinking of watched inode) and that is not too bad. I have
measured time to create & delete inotify watch 100000 times from 64
processes in parallel (each process having its own inotify group and its
own file on a shared superblock) on a 64 CPU machine. Average and
standard deviation of 15 runs look like:

	Avg		Stddev
Vanilla	9.817400	0.276165
Fixed	9.710467	0.228294

So there's no statistically significant difference.

Fixes: 6b3f05d24d ("fsnotify: Detach mark from object list when last reference is dropped")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-25 15:49:19 +02:00
Amir Goldstein d0a6a87e40 fanotify: support reporting thread id instead of process id
In order to identify which thread triggered the event in a
multi-threaded program, add the FAN_REPORT_TID flag in fanotify_init to
opt-in for reporting the event creator's thread id information.

Signed-off-by: nixiaoming <nixiaoming@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-08 13:48:45 +02:00
Amir Goldstein bdd5a46fe3 fanotify: add BUILD_BUG_ON() to count the bits of fanotify constants
Also define the FANOTIFY_EVENT_FLAGS consisting of the extra flags
FAN_ONDIR and FAN_ON_CHILD.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:30:03 +02:00
Amir Goldstein a39f7ec417 fsnotify: convert runtime BUG_ON() to BUILD_BUG_ON()
The BUG_ON() statements to verify number of bits in ALL_FSNOTIFY_BITS
and ALL_INOTIFY_BITS are converted to build time check of the constant.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:28:50 +02:00
Amir Goldstein 23c9deeb32 fanotify: deprecate uapi FAN_ALL_* constants
We do not want to add new bits to the FAN_ALL_* uapi constants
because they have been exposed to userspace.  If there are programs
out there using these constants, those programs could break if
re-compiled with modified FAN_ALL_* constants and run on an old kernel.

We deprecate the uapi constants FAN_ALL_* and define new FANOTIFY_*
constants for internal use to replace them. New feature bits will be
added only to the new constants.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:28:38 +02:00
Amir Goldstein a72fd224e3 fanotify: simplify handling of FAN_ONDIR
fanotify mark add/remove code jumps through hoops to avoid setting the
FS_ISDIR in the commulative object mask.

That was just papering over a bug in fsnotify() handling of the FS_ISDIR
extra flag. This bug is now fixed, so all the hoops can be removed along
with the unneeded internal flag FAN_MARK_ONDIR.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:28:29 +02:00
Amir Goldstein 007d1e8395 fsnotify: generalize handling of extra event flags
FS_EVENT_ON_CHILD gets a special treatment in fsnotify() because it is
not a flag specifying an event type, but rather an extra flags that may
be reported along with another event and control the handling of the
event by the backend.

FS_ISDIR is also an "extra flag" and not an "event type" and therefore
desrves the same treatment. With inotify/dnotify backends it was never
possible to set FS_ISDIR in mark masks, so it did not matter.
With fanotify backend, mark adding code jumps through hoops to avoid
setting the FS_ISDIR in the commulative object mask.

Separate the constant ALL_FSNOTIFY_EVENTS to ALL_FSNOTIFY_FLAGS and
ALL_FSNOTIFY_EVENTS, so the latter can be used to test for specific
event types.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:28:02 +02:00
Amir Goldstein 96a71f21ef fanotify: store fanotify_init() flags in group's fanotify_data
This averts the need to re-generate flags in fanotify_show_fdinfo()
and sets the scene for addition of more upcoming flags without growing
new members to the fanotify_data struct.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-27 15:29:00 +02:00
Amir Goldstein d54f4fba88 fanotify: add API to attach/detach super block mark
Add another mark type flag FAN_MARK_FILESYSTEM for add/remove/flush
of super block mark type.

A super block watch gets all events on the filesystem, regardless of
the mount from which the mark was added, unless an ignore mask exists
on either the inode or the mount where the event was generated.

Only one of FAN_MARK_MOUNT and FAN_MARK_FILESYSTEM mark type flags
may be provided to fanotify_mark() or no mark type flag for inode mark.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 15:16:11 +02:00
Amir Goldstein 60f7ed8c7c fsnotify: send path type events to group with super block marks
Send events to group if super block mark mask matches the event
and unless the same group has an ignore mask on the vfsmount or
the inode on which the event occurred.

Soon, fanotify backend is going to support super block marks and
fanotify backend only supports path type events.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 15:14:04 +02:00
Amir Goldstein 1e6cb72399 fsnotify: add super block object type
Add the infrastructure to attach a mark to a super_block struct
and detach all attached marks when super block is destroyed.

This is going to be used by fanotify backend to setup super block
marks.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 15:14:01 +02:00
Amir Goldstein 9bdda4e9cf fsnotify: fix ignore mask logic in fsnotify()
Commit 92183a4289 ("fsnotify: fix ignore mask logic in
send_to_group()") acknoledges the use case of ignoring an event on
an inode mark, because of an ignore mask on a mount mark of the same
group (i.e. I want to get all events on this file, except for the events
that came from that mount).

This change depends on correctly merging the inode marks and mount marks
group lists, so that the mount mark ignore mask would be tested in
send_to_group(). Alas, the merging of the lists did not take into
account the case where event in question is not in the mask of any of
the mount marks.

To fix this, completely remove the tests for inode and mount event masks
from the lists merging code.

Fixes: 92183a4289 ("fsnotify: fix ignore mask logic in send_to_group")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 14:57:41 +02:00
Linus Torvalds f3f106dac0 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAluGfbAACgkQnJ2qBz9k
 QNmkVQgAjv/tCHStwoQ4Hhr6q5wLU9apAW5WC7Or2MyNfqoJsKgyujpikwFMa0xY
 wkVqlITYavvUKl/nRlQ6hpZtAY7hi1Y7GvIYs2ci6QO4YAUuBoH7qPLKqyZYzXx+
 mBaC68885nMMDqIHvsCcLurwTiDer6XQXXPDmpMoO9g4kyVuhm2e0/M7CPaA4Ue9
 WEfBPBROSNdRH7Wtww/MUvtfd+9mezdlpUQZVVO5cdhAVQ8pzW2g2piYtwIpaY7E
 DiJ1zcgaqCnni+b69x4wz8TwB0q8wZJjrPfJgf4X7mIGBZrw80yNRlevfe5SxwdJ
 mkMDvjaD0TEItvpjgTZReXaAjFOWTA==
 =E61M
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull misc fs fixes from Jan Kara:

 - make UDF to properly mount media created by Win7

 - make isofs to properly refuse devices with large physical block size

 - fix a Spectre gadget in quotactl(2)

 - fix a warning in fsnotify code hit by syzkaller

* tag 'for_v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Fix mounting of Win7 created UDF filesystems
  udf: Remove dead code from udf_find_fileset()
  fs/quota: Fix spectre gadget in do_quotactl
  fs/quota: Replace XQM_MAXQUOTAS usage with MAXQUOTAS
  isofs: reject hardware sector size > 2048 bytes
  fsnotify: fix false positive warning on inode delete
2018-08-29 14:56:45 -07:00
Linus Torvalds 0214f46b3a Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull core signal handling updates from Eric Biederman:
 "It was observed that a periodic timer in combination with a
  sufficiently expensive fork could prevent fork from every completing.
  This contains the changes to remove the need for that restart.

  This set of changes is split into several parts:

   - The first part makes PIDTYPE_TGID a proper pid type instead
     something only for very special cases. The part starts using
     PIDTYPE_TGID enough so that in __send_signal where signals are
     actually delivered we know if the signal is being sent to a a group
     of processes or just a single process.

   - With that prep work out of the way the logic in fork is modified so
     that fork logically makes signals received while it is running
     appear to be received after the fork completes"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
  signal: Don't send signals to tasks that don't exist
  signal: Don't restart fork when signals come in.
  fork: Have new threads join on-going signal group stops
  fork: Skip setting TIF_SIGPENDING in ptrace_init_task
  signal: Add calculate_sigpending()
  fork: Unconditionally exit if a fatal signal is pending
  fork: Move and describe why the code examines PIDNS_ADDING
  signal: Push pid type down into complete_signal.
  signal: Push pid type down into __send_signal
  signal: Push pid type down into send_signal
  signal: Pass pid type into do_send_sig_info
  signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
  signal: Pass pid type into group_send_sig_info
  signal: Pass pid and pid type into send_sigqueue
  posix-timers: Noralize good_sigevent
  signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
  pid: Implement PIDTYPE_TGID
  pids: Move the pgrp and session pid pointers from task_struct to signal_struct
  kvm: Don't open code task_pid in kvm_vcpu_ioctl
  pids: Compute task_tgid using signal->leader_pid
  ...
2018-08-21 13:47:29 -07:00
Jan Kara d3bc0fa841 fsnotify: fix false positive warning on inode delete
When inode is getting deleted and someone else holds reference to a mark
attached to the inode, we just detach the connector from the inode. In
that case fsnotify_put_mark() called from fsnotify_destroy_marks() will
decide to recalculate mask for the inode and __fsnotify_recalc_mask()
will WARN about invalid connector type:

WARNING: CPU: 1 PID: 12015 at fs/notify/mark.c:139
__fsnotify_recalc_mask+0x2d7/0x350 fs/notify/mark.c:139

Actually there's no reason to warn about detached connector in
__fsnotify_recalc_mask() so just silently skip updating the mask in such
case.

Reported-by: syzbot+c34692a51b9a6ca93540@syzkaller.appspotmail.com
Fixes: 3ac70bfcde ("fsnotify: add helper to get mask from connector")
Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-20 13:55:45 +02:00
Linus Torvalds 6ada4e2826 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc things

 - a few Y2038 fixes

 - ntfs fixes

 - arch/sh tweaks

 - ocfs2 updates

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
  mm/hmm.c: remove unused variables align_start and align_end
  fs/userfaultfd.c: remove redundant pointer uwq
  mm, vmacache: hash addresses based on pmd
  mm/list_lru: introduce list_lru_shrink_walk_irq()
  mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
  mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
  mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
  mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
  mm/sparse: delete old sparse_init and enable new one
  mm/sparse: add new sparse_init_nid() and sparse_init()
  mm/sparse: move buffer init/fini to the common place
  mm/sparse: use the new sparse buffer functions in non-vmemmap
  mm/sparse: abstract sparse buffer allocations
  mm/hugetlb.c: don't zero 1GiB bootmem pages
  mm, page_alloc: double zone's batchsize
  mm/oom_kill.c: document oom_lock
  mm/hugetlb: remove gigantic page support for HIGHMEM
  mm, oom: remove sleep from under oom_lock
  kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
  mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
  ...
2018-08-17 16:49:31 -07:00
Shakeel Butt d46eb14b73 fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.

The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs.  All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.

The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated.  However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg.  This patch series
contains two such concrete use-cases i.e.  fsnotify and buffer_head.

The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener.  The events
are allocated in the context of the event producer.  However they should
be charged to the event consumer.  Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.

To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg.  In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged.  For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.

This patch (of 2):

A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener.  This can cause
system level memory pressure or OOMs.  So, it's better to account the
fsnotify kmem caches to the memcg of the listener.

However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer.  This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.

There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener.  So, SLAB_ACCOUNT is enough for these caches.

The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.

The allocations from the event caches happen in the context of the event
producer.  For such caches we will need to remote charge the allocations
to the listener's memcg.  Thus we save the memcg reference in the
fsnotify_group structure of the listener.

This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.

[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
  Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Eric W. Biederman 019191342f signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
When f_setown is called a pid and a pid type are stored.  Replace the use
of PIDTYPE_PID with PIDTYPE_TGID as PIDTYPE_TGID goes to the entire thread
group.  Replace the use of PIDTYPE_MAX with PIDTYPE_PID as PIDTYPE_PID now
is only for a thread.

Update the users of __f_setown to use PIDTYPE_TGID instead of
PIDTYPE_PID.

For now the code continues to capture task_pid (when task_tgid would
really be appropriate), and iterate on PIDTYPE_PID (even when type ==
PIDTYPE_TGID) out of an abundance of caution to preserve existing
behavior.

Oleg Nesterov suggested using the test to ensure we use PIDTYPE_PID
for tgid lookup also be used to avoid taking the tasklist lock.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21 10:43:12 -05:00
Eric W. Biederman 7a36094d61 pids: Compute task_tgid using signal->leader_pid
The cost is the the same and this removes the need
to worry about complications that come from de_thread
and group_leader changing.

__task_pid_nr_ns has been updated to take advantage of this change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21 10:43:12 -05:00
Henry Wilson 4d97f7d53d inotify: Add flag IN_MASK_CREATE for inotify_add_watch()
The flag IN_MASK_CREATE is introduced as a flag for inotiy_add_watch()
which prevents inotify from modifying any existing watches when invoked.
If the pathname specified in the call has a watched inode associated
with it and IN_MASK_CREATE is specified, fail with an errno of EEXIST.

Use of IN_MASK_CREATE with IN_MASK_ADD is reserved for future use and
will return EINVAL.

RATIONALE

In the current implementation, there is no way to prevent
inotify_add_watch() from modifying existing watch descriptors. Even if
the caller keeps a record of all watch descriptors collected, this is
only sufficient to detect that an existing watch descriptor may have
been modified.

The assumption that a particular path will map to the same inode over
multiple calls to inotify_add_watch() cannot be made as files can be
renamed or deleted.  It is also not possible to assume that two distinct
paths do no map to the same inode, due to hard-links or a dereferenced
symbolic link. Further uses of inotify_add_watch() to revert the change
may cause other watch descriptors to be modified or created, merely
compunding the problem. There is currently no system call such as
inotify_modify_watch() to explicity modify a watch descriptor, which
would be able to revert unwanted changes. Thus the caller cannot
guarantee to be able to revert any changes to existing watch decriptors.

Additionally the caller cannot assume that the events that are
associated with a watch descriptor are within the set requested, as any
future calls to inotify_add_watch() may unintentionally modify a watch
descriptor's mask. Thus it cannot currently be guaranteed that a watch
descriptor will only generate events which have been requested. The
program must filter events which come through its watch descriptor to
within its expected range.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Henry Wilson <henry.wilson@acentic.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-27 19:21:25 +02:00
Amir Goldstein eaa2c6b0c9 fanotify: factor out helpers to add/remove mark
Factor out helpers fanotify_add_mark() and fanotify_remove_mark()
to reduce duplicated code.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-27 13:45:12 +02:00
Amir Goldstein 3ac70bfcde fsnotify: add helper to get mask from connector
Use a helper to get the mask from the object (i.e. i_fsnotify_mask)
to generalize code of add/remove inode/vfsmount mark.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-27 13:45:07 +02:00
Amir Goldstein 36f10f55ff fsnotify: let connector point to an abstract object
Make the code to attach/detach a connector to object more generic
by letting the fsnotify connector point to an abstract fsnotify_connp_t.
Code that needs to dereference an inode or mount object now uses the
helpers fsnotify_conn_{inode,mount}.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-27 13:45:05 +02:00
Amir Goldstein b812a9f589 fsnotify: pass connp and object type to fsnotify_add_mark()
Instead of passing inode and vfsmount arguments to fsnotify_add_mark()
and its _locked variant, pass an abstract object pointer and the object
type.

The helpers fsnotify_obj_{inode,mount} are added to get the concrete
object pointer from abstract object pointer.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-27 13:45:03 +02:00
Amir Goldstein 9b6e543450 fsnotify: use typedef fsnotify_connp_t for brevity
The object marks manipulation functions fsnotify_destroy_marks()
fsnotify_find_mark() and their helpers take an argument of type
struct fsnotify_mark_connector __rcu ** to dereference the connector
pointer. use a typedef to describe this type for brevity.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-27 13:44:59 +02:00
Amir Goldstein b249f5be61 fsnotify: add fsnotify_add_inode_mark() wrappers
Before changing the arguments of the functions fsnotify_add_mark()
and fsnotify_add_mark_locked(), convert most callers to use a wrapper.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-18 14:58:22 +02:00
Amir Goldstein 837a393438 fanotify: generalize fanotify_should_send_event()
Use fsnotify_foreach_obj_type macros to generalize the code that filters
events by marks mask and ignored_mask.

This is going to be used for adding mark of super block object type.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-18 14:58:22 +02:00
Amir Goldstein 3dca1a7494 fsnotify: generalize send_to_group()
Use fsnotify_foreach_obj_type macros to generalize the code that filters
events by marks mask and ignored_mask.

This is going to be used for adding mark of super block object type.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-18 14:58:22 +02:00
Amir Goldstein 47d9c7cc45 fsnotify: generalize iteration of marks by object type
Make some code that handles marks of object types inode and vfsmount
generic, so it can handle other object types.

Introduce fsnotify_foreach_obj_type macro to iterate marks by object type
and fsnotify_iter_{should|set}_report_type macros to set/test report_mask.

This is going to be used for adding mark of another object type
(super block mark).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-18 14:58:22 +02:00
Amir Goldstein d9a6f30bb8 fsnotify: introduce marks iteration helpers
Introduce helpers fsnotify_iter_select_report_types() and
fsnotify_iter_next() to abstract the inode/vfsmount marks merged
list iteration.

This is a preparation patch before generalizing mark list
iteration to more mark object types (i.e. super block marks).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-18 14:58:22 +02:00
Amir Goldstein 5b0457ad02 fsnotify: remove redundant arguments to handle_event()
inode_mark and vfsmount_mark arguments are passed to handle_event()
operation as function arguments as well as on iter_info struct.
The difference is that iter_info struct may contain marks that should
not be handled and are represented as NULL arguments to inode_mark or
vfsmount_mark.

Instead of passing the inode_mark and vfsmount_mark arguments, add
a report_mask member to iter_info struct to indicate which marks should
be handled, versus marks that should only be kept alive during user
wait.

This change is going to be used for passing more mark types
with handle_event() (i.e. super block marks).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-18 14:58:22 +02:00
Amir Goldstein d6f7b98bc8 fsnotify: use type id to identify connector object type
An fsnotify_mark_connector is referencing a single type of object
(either inode or vfsmount). Instead of storing a type mask in
connector->flags, store a single type id in connector->type to
identify the type of object.

When a connector object is detached from the object, its type is set
to FSNOTIFY_OBJ_TYPE_DETACHED and this object is not going to be
reused.

The function fsnotify_clear_marks_by_group() is the only place where
type mask was used, so use type flags instead of type id to this
function.

This change is going to be more convenient when adding a new object
type (super block).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-18 14:58:22 +02:00
Amir Goldstein 92183a4289 fsnotify: fix ignore mask logic in send_to_group()
The ignore mask logic in send_to_group() does not match the logic
in fanotify_should_send_event(). In the latter, a vfsmount mark ignore
mask precedes an inode mark mask and in the former, it does not.

That difference may cause events to be sent to fanotify backend for no
reason. Fix the logic in send_to_group() to match that of
fanotify_should_send_event().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-04-13 15:52:49 +02:00
Amir Goldstein 54a307ba8d fanotify: fix logic of events on child
When event on child inodes are sent to the parent inode mark and
parent inode mark was not marked with FAN_EVENT_ON_CHILD, the event
will not be delivered to the listener process. However, if the same
process also has a mount mark, the event to the parent inode will be
delivered regadless of the mount mark mask.

This behavior is incorrect in the case where the mount mark mask does
not contain the specific event type. For example, the process adds
a mark on a directory with mask FAN_MODIFY (without FAN_EVENT_ON_CHILD)
and a mount mark with mask FAN_CLOSE_NOWRITE (without FAN_ONDIR).

A modify event on a file inside that directory (and inside that mount)
should not create a FAN_MODIFY event, because neither of the marks
requested to get that event on the file.

Fixes: 1968f5eed5 ("fanotify: use both marks when possible")
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-04-09 16:30:14 +02:00
Linus Torvalds be88751f32 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc filesystem updates from Jan Kara:
 "udf, ext2, quota, fsnotify fixes & cleanups:

   - udf fixes for handling of media without uid/gid

   - udf fixes for some corner cases in parsing of volume recognition
     sequence

   - improvements of fsnotify handling of ENOMEM

   - new ioctl to allow setting of watch descriptor id for inotify (for
     checkpoint - restart)

   - small ext2, reiserfs, quota cleanups"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: Kill an unused extern entry form quota.h
  reiserfs: Remove VLA from fs/reiserfs/reiserfs.h
  udf: fix potential refcnt problem of nls module
  ext2: change return code to -ENOMEM when failing memory allocation
  udf: Do not mark possibly inconsistent filesystems as closed
  fsnotify: Let userspace know about lost events due to ENOMEM
  fanotify: Avoid lost events due to ENOMEM for unlimited queues
  udf: Remove never implemented mount options
  udf: Update mount option documentation
  udf: Provide saner default for invalid uid / gid
  udf: Clean up handling of invalid uid/gid
  udf: Apply uid/gid mount options also to new inodes & chown
  udf: Ignore [ug]id=ignore mount options
  udf: Fix handling of Partition Descriptors
  udf: Unify common handling of descriptors
  udf: Convert descriptor index definitions to enum
  udf: Allow volume descriptor sequence to be terminated by unrecorded block
  udf: Simplify handling of Volume Descriptor Pointers
  udf: Fix off-by-one in volume descriptor sequence length
  inotify: Extend ioctl to allow to request id of new watch descriptor
2018-04-05 19:17:50 -07:00
Dominik Brodowski 183caa3c86 fanotify: add do_fanotify_mark() helper; remove in-kernel call to syscall
Using the fs-internal do_fanotify_mark() helper allows us to get rid of
the fs-internal call to the sys_fanotify_mark() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Acked-by: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:45 +02:00
Dominik Brodowski d0d89d1ed3 inotify: add do_inotify_init() helper; remove in-kernel call to syscall
Using the inotify-internal do_inotify_init() helper allows us to get rid
of the in-kernel call to sys_inotify_init1() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Acked-by: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:45 +02:00
Jan Kara 7b1f641776 fsnotify: Let userspace know about lost events due to ENOMEM
Currently if notification event is lost due to event allocation failing
we ENOMEM, we just silently continue (except for fanotify permission
events where we deny the access). This is undesirable as userspace has
no way of knowing whether the notifications it got are complete or not.
Treat lost events due to ENOMEM the same way as lost events due to queue
overflow so that userspace knows something bad happened and it likely
needs to rescan the filesystem.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-27 10:25:33 +01:00
Jan Kara 1f5eaa9001 fanotify: Avoid lost events due to ENOMEM for unlimited queues
Fanotify queues of unlimited length do not expect events can be lost.
Since these queues are used for system auditing and other security
related tasks, loosing events can even have security implications.
Currently, since the allocation is small (32-bytes), it cannot fail
however when we start accounting events in memcgs, allocation can start
failing. So avoid loosing events due to failure to allocate memory by
making event allocation use __GFP_NOFAIL.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-27 10:25:33 +01:00
Kirill Tkhai e1603b6eff inotify: Extend ioctl to allow to request id of new watch descriptor
Watch descriptor is id of the watch created by inotify_add_watch().
It is allocated in inotify_add_to_idr(), and takes the numbers
starting from 1. Every new inotify watch obtains next available
number (usually, old + 1), as served by idr_alloc_cyclic().

CRIU (Checkpoint/Restore In Userspace) project supports inotify
files, and restores watched descriptors with the same numbers,
they had before dump. Since there was no kernel support, we
had to use cycle to add a watch with specific descriptor id:

	while (1) {
		int wd;

		wd = inotify_add_watch(inotify_fd, path, mask);
		if (wd < 0) {
			break;
		} else if (wd == desired_wd_id) {
			ret = 0;
			break;
		}

		inotify_rm_watch(inotify_fd, wd);
	}

(You may find the actual code at the below link:
 https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577)

The cycle is suboptiomal and very expensive, but since there is no better
kernel support, it was the only way to restore that. Happily, we had met
mostly descriptors with small id, and this approach had worked somehow.

But recent time containers with inotify with big watch descriptors
begun to come, and this way stopped to work at all. When descriptor id
is something about 0x34d71d6, the restoring process spins in busy loop
for a long time, and the restore hungs and delay of migration from node
to node could easily be watched.

This patch aims to solve this problem. It introduces new ioctl
INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created
watch descriptor from userspace. It simply calls idr_set_cursor() primitive
to populate idr::idr_next, so that next idr_alloc_cyclic() allocation
will return this id, if it is not occupied. This is the way which is
used to restore some other resources from userspace. For example,
/proc/sys/kernel/ns_last_pid works the same for task pids.

The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system
may exclude it.

v2: Use INT_MAX instead of custom definition of max id,
as IDR subsystem guarantees id is between 0 and INT_MAX.

CC: Jan Kara <jack@suse.cz>
CC: Matthew Wilcox <willy@infradead.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-14 11:16:28 +01:00
Linus Torvalds a9a08845e9 vfs: do bulk POLL* -> EPOLL* replacement
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-11 14:34:03 -08:00
Linus Torvalds 168fe32a07 Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull poll annotations from Al Viro:
 "This introduces a __bitwise type for POLL### bitmap, and propagates
  the annotations through the tree. Most of that stuff is as simple as
  'make ->poll() instances return __poll_t and do the same to local
  variables used to hold the future return value'.

  Some of the obvious brainos found in process are fixed (e.g. POLLIN
  misspelled as POLL_IN). At that point the amount of sparse warnings is
  low and most of them are for genuine bugs - e.g. ->poll() instance
  deciding to return -EINVAL instead of a bitmap. I hadn't touched those
  in this series - it's large enough as it is.

  Another problem it has caught was eventpoll() ABI mess; select.c and
  eventpoll.c assumed that corresponding POLL### and EPOLL### were
  equal. That's true for some, but not all of them - EPOLL### are
  arch-independent, but POLL### are not.

  The last commit in this series separates userland POLL### values from
  the (now arch-independent) kernel-side ones, converting between them
  in the few places where they are copied to/from userland. AFAICS, this
  is the least disruptive fix preserving poll(2) ABI and making epoll()
  work on all architectures.

  As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
  it will trigger only on what would've triggered EPOLLWRBAND on other
  architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
  at all on sparc. With this patch they should work consistently on all
  architectures"

* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  make kernel-side POLL... arch-independent
  eventpoll: no need to mask the result of epi_item_poll() again
  eventpoll: constify struct epoll_event pointers
  debugging printk in sg_poll() uses %x to print POLL... bitmap
  annotate poll(2) guts
  9p: untangle ->poll() mess
  ->si_band gets POLL... bitmap stored into a user-visible long field
  ring_buffer_poll_wait() return value used as return value of ->poll()
  the rest of drivers/*: annotate ->poll() instances
  media: annotate ->poll() instances
  fs: annotate ->poll() instances
  ipc, kernel, mm: annotate ->poll() instances
  net: annotate ->poll() instances
  apparmor: annotate ->poll() instances
  tomoyo: annotate ->poll() instances
  sound: annotate ->poll() instances
  acpi: annotate ->poll() instances
  crypto: annotate ->poll() instances
  block: annotate ->poll() instances
  x86: annotate ->poll() instances
  ...
2018-01-30 17:58:07 -08:00
Al Viro 076ccb76e1 fs: annotate ->poll() instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:20:05 -05:00
Linus Torvalds 1751e8a6cb Rename superblock flags (MS_xyz -> SB_xyz)
This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.

The script to do this was:

    # places to look in; re security/*: it generally should *not* be
    # touched (that stuff parses mount(2) arguments directly), but
    # there are two places where we really deal with superblock flags.
    FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
            include/linux/fs.h include/uapi/linux/bfs_fs.h \
            security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
    # the list of MS_... constants
    SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
          DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
          POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
          I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
          ACTIVE NOUSER"

    SED_PROG=
    for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

    # we want files that contain at least one of MS_...,
    # with fs/namespace.c and fs/pnode.c excluded.
    L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

    for f in $L; do sed -i $f $SED_PROG; done

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-27 13:05:09 -08:00
Linus Torvalds 9682b3dea2 Merge branch 'for-linus' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "The usual rocket-science from trivial tree for 4.15"

* 'for-linus' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  MAINTAINERS: relinquish kconfig
  MAINTAINERS: Update my email address
  treewide: Fix typos in Kconfig
  kfifo: Fix comments
  init/Kconfig: Fix module signing document location
  misc: ibmasm: Return error on error path
  HID: logitech-hidpp: fix mistake in printk, "feeback" -> "feedback"
  MAINTAINERS: Correct path to uDraw PS3 driver
  tracing: Fix doc mistakes in trace sample
  tracing: Kconfig text fixes for CONFIG_HWLAT_TRACER
  MIPS: Alchemy: Remove reverted CONFIG_NETLINK_MMAP from db1xxx_defconfig
  mm/huge_memory.c: fixup grammar in comment
  lib/xz: Add fall-through comments to a switch statement
2017-11-15 10:14:11 -08:00
Linus Torvalds f14fc0ccee Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, ext2, isofs and udf fixes from Jan Kara:

 - two small quota error handling fixes

 - two isofs fixes for architectures with signed char

 - several udf block number overflow and signedness fixes

 - ext2 rework of mount option handling to avoid GFP_KERNEL allocation
   with spinlock held

 - ... it also contains a patch to implement auditing of responses to
   fanotify permission events. That should have been in the fanotify
   pull request but I mistakenly merged that patch into a wrong branch
   and noticed only now at which point I don't think it's worth rebasing
   and redoing.

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: be aware of error from dquot_initialize
  quota: fix potential infinite loop
  isofs: use unsigned char types consistently
  isofs: fix timestamps beyond 2027
  udf: Fix some sign-conversion warnings
  udf: Fix signed/unsigned format specifiers
  udf: Fix 64-bit sign extension issues affecting blocks > 0x7FFFFFFF
  udf: Remove some outdate references from documentation
  udf: Avoid overflow when session starts at large offset
  ext2: Fix possible sleep in atomic during mount option parsing
  ext2: Parse mount options into a dedicated structure
  audit: Record fanotify access control decisions
2017-11-14 14:13:11 -08:00
Linus Torvalds 23281c8034 Merge branch 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:

 - fixes of use-after-tree issues when handling fanotify permission
   events from Miklos

 - refcount_t conversions from Elena

 - fixes of ENOMEM handling in dnotify and fsnotify from me

* 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: convert fsnotify_mark.refcnt from atomic_t to refcount_t
  fanotify: clean up CONFIG_FANOTIFY_ACCESS_PERMISSIONS ifdefs
  fsnotify: clean up fsnotify()
  fanotify: fix fsnotify_prepare_user_wait() failure
  fsnotify: fix pinning group in fsnotify_prepare_user_wait()
  fsnotify: pin both inode and vfsmount mark
  fsnotify: clean up fsnotify_prepare/finish_user_wait()
  fsnotify: convert fsnotify_group.refcnt from atomic_t to refcount_t
  fsnotify: Protect bail out path of fsnotify_add_mark_locked() properly
  dnotify: Handle errors from fsnotify_add_mark_locked() in fcntl_dirnotify()
2017-11-14 14:08:20 -08:00
Greg Kroah-Hartman b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Elena Reshetova ab97f87325 fsnotify: convert fsnotify_mark.refcnt from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable fsnotify_mark.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-31 17:54:56 +01:00
Miklos Szeredi 6685df3125 fanotify: clean up CONFIG_FANOTIFY_ACCESS_PERMISSIONS ifdefs
The only negative from this patch should be an addition of 32bytes to
'struct fsnotify_group' if CONFIG_FANOTIFY_ACCESS_PERMISSIONS is not
defined.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-31 17:54:56 +01:00
Miklos Szeredi 3427ce7155 fsnotify: clean up fsnotify()
Use helpers to get first and next marks from connector.

Also get rid of inode_node/vfsmount_node local variables, which just refers
to the same objects as iter_info.  There was an srcu_dereference() for
foo_node, but that's completely superfluous since we've already done it
when obtaining foo_node.

Also get rid of inode_group/vfsmount_group local variables; checking
against non-NULL for these is the same as checking against non-NULL
inode_mark/vfsmount_mark.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-31 17:54:56 +01:00
Miklos Szeredi f37650f1c7 fanotify: fix fsnotify_prepare_user_wait() failure
If fsnotify_prepare_user_wait() fails, we leave the event on the
notification list.  Which will result in a warning in
fsnotify_destroy_event() and later use-after-free.

Instead of adding a new helper to remove the event from the list in this
case, I opted to move the prepare/finish up into fanotify_handle_event().

This will allow these to be moved further out into the generic code later,
and perhaps let us move to non-sleeping RCU.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 05f0e38724 ("fanotify: Release SRCU lock when waiting for userspace response")
Cc: <stable@vger.kernel.org> # v4.12
Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-31 17:54:56 +01:00
Miklos Szeredi 9a31d7ad99 fsnotify: fix pinning group in fsnotify_prepare_user_wait()
Blind increment of group's user_waits is not enough, we could be far enough
in the group's destruction that it isn't taken into account (i.e. grabbing
the mark ref afterwards doesn't guarantee that it was the ref coming from
the _group_ that was grabbed).

Instead we need to check (under lock) that the mark is still attached to
the group after having obtained a ref to the mark.  If not, skip it.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 9385a84d7e ("fsnotify: Pass fsnotify_iter_info into handle_event handler")
Cc: <stable@vger.kernel.org> # v4.12
Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-31 17:54:56 +01:00
Miklos Szeredi 0d6ec079d6 fsnotify: pin both inode and vfsmount mark
We may fail to pin one of the marks in fsnotify_prepare_user_wait() when
dropping the srcu read lock, resulting in use after free at the next
iteration.

Solution is to store both marks in iter_info instead of just the one we'll
be sending the event for.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 9385a84d7e ("fsnotify: Pass fsnotify_iter_info into handle_event handler")
Cc: <stable@vger.kernel.org> # v4.12
Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-31 17:54:56 +01:00
Miklos Szeredi 24c20305c7 fsnotify: clean up fsnotify_prepare/finish_user_wait()
This patch doesn't actually fix any bug, just paves the way for fixing mark
and group pinning.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.12
Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-31 17:54:56 +01:00
Elena Reshetova 7761daa6a1 fsnotify: convert fsnotify_group.refcnt from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable fsnotify_group.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-31 17:54:56 +01:00
Jan Kara 9cf90cef36 fsnotify: Protect bail out path of fsnotify_add_mark_locked() properly
When fsnotify_add_mark_locked() fails it cleans up the mark it was
adding. Since the mark is already visible in group's list, we should
protect update of mark->flags with mark->lock. I'm not aware of any real
issues this could cause (since we also hold group->mark_mutex) but
better be safe and obey locking rules properly.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-31 17:54:56 +01:00
Jan Kara b3a0066005 dnotify: Handle errors from fsnotify_add_mark_locked() in fcntl_dirnotify()
fsnotify_add_mark_locked() can fail but we do not check its return
value. This didn't matter before commit 9dd813c15b "fsnotify: Move
mark list head from object into dedicated structure" as none of possible
failures could happen for dnotify but after that commit -ENOMEM can be
returned. Handle this error properly in fcntl_dirnotify() as
otherwise we just hit BUG_ON(dn_mark->dn) in dnotify_free_mark().

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reported-by: syzkaller
Fixes: 9dd813c15b
Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-31 17:41:04 +01:00
Masanari Iida 83fc61a563 treewide: Fix typos in Kconfig
This patch fixes some spelling typos found in Kconfig files.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-10-12 15:42:00 +02:00
Steve Grubb de8cd83e91 audit: Record fanotify access control decisions
The fanotify interface allows user space daemons to make access
control decisions. Under common criteria requirements, we need to
optionally record decisions based on policy. This patch adds a bit mask,
FAN_AUDIT, that a user space daemon can 'or' into the response decision
which will tell the kernel that it made a decision and record it.

It would be used something like this in user space code:

  response.response = FAN_DENY | FAN_AUDIT;
  write(fd, &response, sizeof(struct fanotify_response));

When the syscall ends, the audit system will record the decision as a
AUDIT_FANOTIFY auxiliary record to denote that the reason this event
occurred is the result of an access control decision from fanotify
rather than DAC or MAC policy.

A sample event looks like this:

type=PATH msg=audit(1504310584.332:290): item=0 name="./evil-ls"
inode=1319561 dev=fc:03 mode=0100755 ouid=1000 ogid=1000 rdev=00:00
obj=unconfined_u:object_r:user_home_t:s0 nametype=NORMAL
type=CWD msg=audit(1504310584.332:290): cwd="/home/sgrubb"
type=SYSCALL msg=audit(1504310584.332:290): arch=c000003e syscall=2
success=no exit=-1 a0=32cb3fca90 a1=0 a2=43 a3=8 items=1 ppid=901
pid=959 auid=1000 uid=1000 gid=1000 euid=1000 suid=1000
fsuid=1000 egid=1000 sgid=1000 fsgid=1000 tty=pts1 ses=3 comm="bash"
exe="/usr/bin/bash" subj=unconfined_u:unconfined_r:unconfined_t:
s0-s0:c0.c1023 key=(null)
type=FANOTIFY msg=audit(1504310584.332:290): resp=2

Prior to using the audit flag, the developer needs to call
fanotify_init or'ing in FAN_ENABLE_AUDIT to ensure that the kernel
supports auditing. The calling process must also have the CAP_AUDIT_WRITE
capability.

Signed-off-by: sgrubb <sgrubb@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-10 13:18:06 +02:00
Bhumika Goyal c9ea9df303 fsnotify: make dnotify_fsnotify_ops const
Make this const as it is never modified.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-30 16:02:48 +02:00
Al Viro 49d31c2f38 dentry name snapshots
take_dentry_name_snapshot() takes a safe snapshot of dentry name;
if the name is a short one, it gets copied into caller-supplied
structure, otherwise an extra reference to external name is grabbed
(those are never modified).  In either case the pointer to stable
string is stored into the same structure.

dentry must be held by the caller of take_dentry_name_snapshot(),
but may be freely dropped afterwards - the snapshot will stay
until destroyed by release_dentry_name_snapshot().

Intended use:
	struct name_snapshot s;

	take_dentry_name_snapshot(&s, dentry);
	...
	access s.name
	...
	release_dentry_name_snapshot(&s);

Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
to pass down with event.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-07 20:09:10 -04:00
Amir Goldstein 4ff33aafd3 fanotify: don't expose EOPENSTALE to userspace
When delivering an event to userspace for a file on an NFS share,
if the file is deleted on server side before user reads the event,
user will not get the event.

If the event queue contained several events, the stale event is
quietly dropped and read() returns to user with events read so far
in the buffer.

If the event queue contains a single stale event or if the stale
event is a permission event, read() returns to user with the kernel
internal error code 518 (EOPENSTALE), which is not a POSIX error code.

Check the internal return value -EOPENSTALE in fanotify_read(), just
the same as it is checked in path_openat() and drop the event in the
cases that it is not already dropped.

This is a reproducer from Marko Rauhamaa:

Just take the example program listed under "man fanotify" ("fantest")
and follow these steps:

    ==============================================================
    NFS Server    NFS Client(1)     NFS Client(2)
    ==============================================================
    # echo foo >/nfsshare/bar.txt
                  # cat /nfsshare/bar.txt
                  foo
                                    # ./fantest /nfsshare
                                    Press enter key to terminate.
                                    Listening for events.
    # rm -f /nfsshare/bar.txt
                  # cat /nfsshare/bar.txt
                                    read: Unknown error 518
                  cat: /nfsshare/bar.txt: Operation not permitted
    ==============================================================

where NFS Client (1) and (2) are two terminal sessions on a single NFS
Client machine.

Reported-by: Marko Rauhamaa <marko.rauhamaa@f-secure.com>
Tested-by: Marko Rauhamaa <marko.rauhamaa@f-secure.com>
Cc: <linux-api@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-25 15:48:06 +02:00
Dan Carpenter f4edce1afd fsnotify: remove a stray unlock
We recently shifted this code around, so we're no longer holding the
lock on this path.

Fixes: 755b5bc681 ("fsnotify: Remove indirection from mark list addition")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-24 16:41:28 +02:00
Jan Kara 054c636e5c fsnotify: Move ->free_mark callback to fsnotify_ops
Pointer to ->free_mark callback unnecessarily occupies one long in each
fsnotify_mark although they are the same for all marks from one
notification group. Move the callback pointer to fsnotify_ops.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara 7b12932340 fsnotify: Add group pointer in fsnotify_init_mark()
Currently we initialize mark->group only in fsnotify_add_mark_lock().
However we will need to access fsnotify_ops of corresponding group from
fsnotify_put_mark() so we need mark->group initialized earlier. Do that
in fsnotify_init_mark() which has a consequence that once
fsnotify_init_mark() is called on a mark, the mark has to be destroyed
by fsnotify_put_mark().

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara ebb3b47e37 fsnotify: Drop inode_mark.c
inode_mark.c now contains only a single function. Move it to
fs/notify/fsnotify.c and remove inode_mark.c.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara b1362edfe1 fsnotify: Remove fsnotify_find_{inode|vfsmount}_mark()
These are very thin wrappers, just remove them. Drop
fs/notify/vfsmount_mark.c as it is empty now.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara 2e37c6ca8d fsnotify: Remove fsnotify_detach_group_marks()
The function is already mostly contained in what
fsnotify_clear_marks_by_group() does. Just update that function to not
select marks when all of them should be destroyed and remove
fsnotify_detach_group_marks().

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara 18f2e0d3a4 fsnotify: Rename fsnotify_clear_marks_by_group_flags()
The _flags() suffix in the function name was more confusing than
explaining so just remove it. Also rename the argument from 'flags' to
'type' to better explain what the function expects.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara 416bcdbcbb fsnotify: Inline fsnotify_clear_{inode|vfsmount}_mark_group()
Inline these helpers as they are very thin. We still keep them as we
don't want to expose details about how list type is determined.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara 8920d2734d fsnotify: Remove fsnotify_recalc_{inode|vfsmount}_mask()
These helpers are just very thin wrappers now. Remove them.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara 66d2b81bcb fsnotify: Remove fsnotify_set_mark_{,ignored_}mask_locked()
These helpers are now only a simple assignment and just obfuscate
what is going on. Remove them.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara 05f0e38724 fanotify: Release SRCU lock when waiting for userspace response
When userspace task processing fanotify permission events screws up and
does not respond, fsnotify_mark_srcu SRCU is held indefinitely which
causes further hangs in the whole notification subsystem. Although we
cannot easily solve the problem of operations blocked waiting for
response from userspace, we can at least somewhat localize the damage by
dropping SRCU lock before waiting for userspace response and reacquiring
it when userspace responds.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara 9385a84d7e fsnotify: Pass fsnotify_iter_info into handle_event handler
Pass fsnotify_iter_info into ->handle_event() handler so that it can
release and reacquire SRCU lock via fsnotify_prepare_user_wait() and
fsnotify_finish_user_wait() functions.  These functions also make sure
current marks are appropriately pinned so that iteration protected by
srcu in fsnotify() stays safe.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara abc77577a6 fsnotify: Provide framework for dropping SRCU lock in ->handle_event
fanotify wants to drop fsnotify_mark_srcu lock when waiting for response
from userspace so that the whole notification subsystem is not blocked
during that time. This patch provides a framework for safely getting
mark reference for a mark found in the object list which pins the mark
in that list. We can then drop fsnotify_mark_srcu, wait for userspace
response and then safely continue iteration of the object list once we
reaquire fsnotify_mark_srcu.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara f09b04a03e fsnotify: Remove special handling of mark destruction on group shutdown
Currently we queue all marks for destruction on group shutdown and then
destroy them from fsnotify_destroy_group() instead from a worker thread
which is the usual path. However worker can already be processing some
list of marks to destroy so this does not make 100% all marks are really
destroyed by the time group is shut down. This isn't a big problem as
each mark holds group reference and thus group stays partially alive
until all marks are really freed but there's no point in complicating
our lives - just wait for the delayed work to be finished instead.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara 6b3f05d24d fsnotify: Detach mark from object list when last reference is dropped
Instead of removing mark from object list from fsnotify_detach_mark(),
remove the mark when last reference to the mark is dropped. This will
allow fanotify to wait for userspace response to event without having to
hold onto fsnotify_mark_srcu.

To avoid pinning inodes by elevated refcount (and thus e.g. delaying
file deletion) while someone holds mark reference, we detach connector
from the object also from fsnotify_destroy_marks() and not only after
removing last mark from the list as it was now.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara 11375145a7 fsnotify: Move queueing of mark for destruction into fsnotify_put_mark()
Currently we queue mark into a list of marks for destruction in
__fsnotify_free_mark() and keep the last mark reference dangling. After the
worker waits for SRCU period, it drops the last reference to the mark
which frees it. This scheme has the disadvantage that if we hold
reference to a mark and drop and reacquire SRCU lock, the mark can get
freed immediately which is slightly inconvenient and we will need to
avoid this in the future.

Move to a scheme where queueing of mark into a list of marks for
destruction happens when the last reference to the mark is dropped. Also
drop reference to the mark held by group list already when mark is
removed from that list instead of dropping it only from the destruction
worker.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara e725376058 inotify: Do not drop mark reference under idr_lock
Dropping mark reference can result in mark being freed. Although it
should not happen in inotify_remove_from_idr() since caller should hold
another reference, just don't risk lock up just after WARN_ON
unnecessarily. Also fold do_inotify_remove_from_idr() into the single
callsite as that function really is just two lines of real code.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara 08991e83b7 fsnotify: Free fsnotify_mark_connector when there is no mark attached
Currently we free fsnotify_mark_connector structure only when inode /
vfsmount is getting freed. This can however impose noticeable memory
overhead when marks get attached to inodes only temporarily. So free the
connector structure once the last mark is detached from the object.
Since notification infrastructure can be working with the connector
under the protection of fsnotify_mark_srcu, we have to be careful and
free the fsnotify_mark_connector only after SRCU period passes.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara 04662cab59 fsnotify: Lock object list with connector lock
So far list of marks attached to an object (inode / vfsmount) was
protected by i_lock or mnt_root->d_lock. This dictates that the list
must be empty before the object can be destroyed although the list is
now anchored in the fsnotify_mark_connector structure. Protect the list
by a spinlock in the fsnotify_mark_connector structure to decouple
lifetime of a list of marks from a lifetime of the object. This also
simplifies the code quite a bit since we don't have to differentiate
between inode and vfsmount lists in quite a few places anymore.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara 2629718dd2 fsnotify: Remove useless list deletion and comment
After removing all the indirection it is clear that

hlist_del_init_rcu(&mark->obj_list);

in fsnotify_destroy_marks() is not needed as the mark gets removed from
the list shortly afterwards in fsnotify_destroy_mark() ->
fsnotify_detach_mark() -> fsnotify_detach_from_object(). Also there is
no problem with mark being visible on object list while we call
fsnotify_destroy_mark() as parallel destruction of marks from several
places is properly handled (as mentioned in the comment in
fsnotify_destroy_marks(). So just remove the list removal and also the
stale comment.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara 73cd3c33ab fsnotify: Avoid double locking in fsnotify_detach_from_object()
We lock object list lock in fsnotify_detach_from_object() twice - once
to detach mark and second time to recalculate mask. That is unnecessary
and later it will become problematic as we will free the connector as
soon as there is no mark in it. So move recalculation of fsnotify mask
into the same critical section that is detaching mark.

This also removes recalculation of child dentry flags from
fsnotify_detach_from_object(). That is however fine. Those marks will
get recalculated once some event happens on a child.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara 8212a6097a fsnotify: Remove indirection from fsnotify_detach_mark()
fsnotify_detach_mark() calls fsnotify_destroy_inode_mark() or
fsnotify_destroy_vfsmount_mark() to remove mark from object list. These
two functions are however very similar and differ only in the lock they
use to protect the object list of marks. Simplify the code by removing
the indirection and removing mark from the object list in a common
function.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara a03e2e4f07 fsnotify: Determine lock in fsnotify_destroy_marks()
Instead of passing spinlock into fsnotify_destroy_marks() determine it
directly in that function from the connector type. This will reduce code
churn when changing lock protecting list of marks.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara f06fd98759 fsnotify: Move locking into fsnotify_find_mark()
Move locking of a mark list into fsnotify_find_mark(). This reduces code
churn in the following patch changing lock protecting the list.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara a242677bb1 fsnotify: Move locking into fsnotify_recalc_mask()
Move locking of locks protecting a list of marks into
fsnotify_recalc_mask(). This reduces code churn in the following patch
which changes the lock protecting the list of marks.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara 0810b4f9f2 fsnotify: Move fsnotify_destroy_marks()
Move fsnotify_destroy_marks() to be later in the fs/notify/mark.c. It
will need some functions that are declared after its current
declaration. No functional change.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara 755b5bc681 fsnotify: Remove indirection from mark list addition
Adding notification mark to object list has been currently done through
fsnotify_add_{inode|vfsmount}_mark() helpers from
fsnotify_add_mark_locked() which call fsnotify_add_mark_list(). Remove
this unnecessary indirection to simplify the code.

Pushing all the locking to fsnotify_add_mark_list() also allows us to
allocate the connector structure with GFP_KERNEL mode.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara e911d8af87 fsnotify: Make fsnotify_mark_connector hold inode reference
Currently inode reference is held by fsnotify marks. Change the rules so
that inode reference is held by fsnotify_mark_connector structure
whenever the list is non-empty. This simplifies the code and is more
logical.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara 86ffe245c4 fsnotify: Move object pointer to fsnotify_mark_connector
Move pointer to inode / vfsmount from mark itself to the
fsnotify_mark_connector structure. This is another step on the path
towards decoupling inode / vfsmount lifetime from notification mark
lifetime.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara 9dd813c15b fsnotify: Move mark list head from object into dedicated structure
Currently notification marks are attached to object (inode or vfsmnt) by
a hlist_head in the object. The list is also protected by a spinlock in
the object. So while there is any mark attached to the list of marks,
the object must be pinned in memory (and thus e.g. last iput() deleting
inode cannot happen). Also for list iteration in fsnotify() to work, we
must hold fsnotify_mark_srcu lock so that mark itself and
mark->obj_list.next cannot get freed. Thus we are required to wait for
response to fanotify events from userspace process with
fsnotify_mark_srcu lock held. That causes issues when userspace process
is buggy and does not reply to some event - basically the whole
notification subsystem gets eventually stuck.

So to be able to drop fsnotify_mark_srcu lock while waiting for
response, we have to pin the mark in memory and make sure it stays in
the object list (as removing the mark waiting for response could lead to
lost notification events for groups later in the list). However we don't
want inode reclaim to block on such mark as that would lead to system
just locking up elsewhere.

This commit is the first in the series that paves way towards solving
these conflicting lifetime needs. Instead of anchoring the list of marks
directly in the object, we anchor it in a dedicated structure
(fsnotify_mark_connector) and just point to that structure from the
object. The following commits will also add spinlock protecting the list
and object pointer to the structure.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:34 +02:00
Jan Kara c1f33073ac fsnotify: Update comments
Add a comment that lifetime of a notification mark is protected by SRCU
and remove a comment about clearing of marks attached to the inode. It
is stale and more uptodate version is at fsnotify_destroy_marks() which
is the function handling this case.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:34 +02:00
Jan Kara c97476400d fanotify: Move recalculation of inode / vfsmount mask under mark_mutex
Move recalculation of inode / vfsmount notification mask under
group->mark_mutex of the mark which was modified. These are the only
places where mask recalculation happens without mark being protected
from detaching from inode / vfsmount which will cause issues with the
following patches.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-03 16:56:40 +02:00
Jan Kara 25c829afbd inotify: Remove inode pointers from debug messages
Printing inode pointers in warnings has dubious value and with future
changes we won't be able to easily get them without either locking or
chances we oops along the way. So just remove inode pointers from the
warning messages.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-03 16:56:34 +02:00
Jan Kara 5198adf649 fsnotify: Remove unnecessary tests when showing fdinfo
show_fdinfo() iterates group's list of marks. All marks found there are
guaranteed to be alive and they stay so until we release
group->mark_mutex. So remove uncecessary tests whether mark is alive.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-03 16:56:18 +02:00
Ingo Molnar 174cd4b1e5 sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:32 +01:00
Ingo Molnar 5b825c3af1 sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>
Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.

Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:31 +01:00
Linus Torvalds f1ef09fde1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "There is a lot here. A lot of these changes result in subtle user
  visible differences in kernel behavior. I don't expect anything will
  care but I will revert/fix things immediately if any regressions show
  up.

  From Seth Forshee there is a continuation of the work to make the vfs
  ready for unpriviled mounts. We had thought the previous changes
  prevented the creation of files outside of s_user_ns of a filesystem,
  but it turns we missed the O_CREAT path. Ooops.

  Pavel Tikhomirov and Oleg Nesterov worked together to fix a long
  standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only
  children that are forked after the prctl are considered and not
  children forked before the prctl. The only known user of this prctl
  systemd forks all children after the prctl. So no userspace
  regressions will occur. Holding earlier forked children to the same
  rules as later forked children creates a semantic that is sane enough
  to allow checkpoing of processes that use this feature.

  There is a long delayed change by Nikolay Borisov to limit inotify
  instances inside a user namespace.

  Michael Kerrisk extends the API for files used to maniuplate
  namespaces with two new trivial ioctls to allow discovery of the
  hierachy and properties of namespaces.

  Konstantin Khlebnikov with the help of Al Viro adds code that when a
  network namespace exits purges it's sysctl entries from the dcache. As
  in some circumstances this could use a lot of memory.

  Vivek Goyal fixed a bug with stacked filesystems where the permissions
  on the wrong inode were being checked.

  I continue previous work on ptracing across exec. Allowing a file to
  be setuid across exec while being ptraced if the tracer has enough
  credentials in the user namespace, and if the process has CAP_SETUID
  in it's own namespace. Proc files for setuid or otherwise undumpable
  executables are now owned by the root in the user namespace of their
  mm. Allowing debugging of setuid applications in containers to work
  better.

  A bug I introduced with permission checking and automount is now
  fixed. The big change is to mark the mounts that the kernel initiates
  as a result of an automount. This allows the permission checks in sget
  to be safely suppressed for this kind of mount. As the permission
  check happened when the original filesystem was mounted.

  Finally a special case in the mount namespace is removed preventing
  unbounded chains in the mount hash table, and making the semantics
  simpler which benefits CRIU.

  The vfs fix along with related work in ima and evm I believe makes us
  ready to finish developing and merge fully unprivileged mounts of the
  fuse filesystem. The cleanups of the mount namespace makes discussing
  how to fix the worst case complexity of umount. The stacked filesystem
  fixes pave the way for adding multiple mappings for the filesystem
  uids so that efficient and safer containers can be implemented"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc/sysctl: Don't grab i_lock under sysctl_lock.
  vfs: Use upper filesystem inode in bprm_fill_uid()
  proc/sysctl: prune stale dentries during unregistering
  mnt: Tuck mounts under others instead of creating shadow/side mounts.
  prctl: propagate has_child_subreaper flag to every descendant
  introduce the walk_process_tree() helper
  nsfs: Add an ioctl() to return owner UID of a userns
  fs: Better permission checking for submounts
  exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
  vfs: open() with O_CREAT should not create inodes with unknown ids
  nsfs: Add an ioctl() to return the namespace type
  proc: Better ownership of files for non-dumpable tasks in user namespaces
  exec: Remove LSM_UNSAFE_PTRACE_CAP
  exec: Test the ptracer's saved cred to see if the tracee can gain caps
  exec: Don't reset euid and egid when the tracee has CAP_SETUID
  inotify: Convert to using per-namespace limits
2017-02-23 20:33:51 -08:00
Linus Torvalds 70fcf5c339 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fixes and cleanups from Jan Kara:
 "Several small UDF fixes and cleanups and a small cleanup of fanotify
  code"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify: simplify the code of fanotify_merge
  udf: simplify udf_ioctl()
  udf: fix ioctl errors
  udf: allow implicit blocksize specification during mount
  udf: check partition reference in udf_read_inode()
  udf: atomically read inode size
  udf: merge module informations in super.c
  udf: remove next_epos from udf_update_extent_cache()
  udf: Factor out trimming of crtime
  udf: remove empty condition
  udf: remove unneeded line break
  udf: merge bh free
  udf: use pointer for kernel_long_ad argument
  udf: use __packed instead of __attribute__ ((packed))
  udf: Make stat on symlink report symlink length as st_size
  fs/udf: make #ifdef UDF_PREALLOCATE unconditional
  fs: udf: Replace CURRENT_TIME with current_time()
2017-02-21 07:44:03 -08:00
Kinglong Mee 6c71100db5 fanotify: simplify the code of fanotify_merge
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-02-09 14:09:22 +01:00
Nikolay Borisov 1cce1eea0a inotify: Convert to using per-namespace limits
This patchset converts inotify to using the newly introduced
per-userns sysctl infrastructure.

Currently the inotify instances/watches are being accounted in the
user_struct structure. This means that in setups where multiple
users in unprivileged containers map to the same underlying
real user (i.e. pointing to the same user_struct) the inotify limits
are going to be shared as well, allowing one user(or application) to exhaust
all others limits.

Fix this by switching the inotify sysctls to using the
per-namespace/per-user limits. This will allow the server admin to
set sensible global limits, which can further be tuned inside every
individual user namespace. Additionally, in order to preserve the
sysctl ABI make the existing inotify instances/watches sysctls
modify the values of the initial user namespace.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-01-24 12:03:07 +13:00
Linus Torvalds 6989606a72 Merge branch 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit
Pull audit fixes from Paul Moore:
 "Two small fixes relating to audit's use of fsnotify.

  The first patch plugs a leak and the second fixes some lock
  shenanigans. The patches are small and I banged on this for an
  afternoon with our testsuite and didn't see anything odd"

* 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit:
  audit: Fix sleep in atomic
  fsnotify: Remove fsnotify_duplicate_mark()
2017-01-05 23:06:06 -08:00
Jan Kara e3ba730702 fsnotify: Remove fsnotify_duplicate_mark()
There are only two calls sites of fsnotify_duplicate_mark(). Those are
in kernel/audit_tree.c and both are bogus. Vfsmount pointer is unused
for audit tree, inode pointer and group gets set in
fsnotify_add_mark_locked() later anyway, mask and free_mark are already
set in alloc_chunk(). In fact, calling fsnotify_duplicate_mark() is
actively harmful because following fsnotify_add_mark_locked() will leak
group reference by overwriting the group pointer. So just remove the two
calls to fsnotify_duplicate_mark() and the function.

Signed-off-by: Jan Kara <jack@suse.cz>
[PM: line wrapping to fit in 80 chars]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-23 16:40:32 -05:00
Linus Torvalds e93b1cc8a8 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, fsnotify and ext2 updates from Jan Kara:
 "Changes to locking of some quota operations from dedicated quota mutex
  to s_umount semaphore, a fsnotify fix and a simple ext2 fix"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: Fix bogus warning in dquot_disable()
  fsnotify: Fix possible use-after-free in inode iteration on umount
  ext2: reject inodes with negative size
  quota: Remove dqonoff_mutex
  ocfs2: Use s_umount for quota recovery protection
  quota: Remove dqonoff_mutex from dquot_scan_active()
  ocfs2: Protect periodic quota syncing with s_umount semaphore
  quota: Use s_umount protection for quota operations
  quota: Hold s_umount in exclusive mode when enabling / disabling quotas
  fs: Provide function to get superblock with exclusive s_umount
2016-12-19 08:23:53 -08:00
Jan Kara 5716863e0f fsnotify: Fix possible use-after-free in inode iteration on umount
fsnotify_unmount_inodes() plays complex tricks to pin next inode in the
sb->s_inodes list when iterating over all inodes. Furthermore the code has a
bug that if the current inode is the last on i_sb_list that does not have e.g.
I_FREEING set, then we leave next_i pointing to inode which may get removed
from the i_sb_list once we drop s_inode_list_lock thus resulting in
use-after-free issues (usually manifesting as infinite looping in
fsnotify_unmount_inodes()).

Fix the problem by keeping current inode pinned somewhat longer. Then we can
make the code much simpler and standard.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2016-12-13 12:57:52 +01:00
Al Viro 12c7f9dc0f constify fsnotify_parent()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 18:58:32 -05:00
Al Viro e637835ecc fsnotify(): constify 'data'
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 18:58:31 -05:00
Al Viro 3cd5eca8d7 fsnotify: constify 'data' passed to ->handle_event()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 18:58:31 -05:00
Jan Kara ed2726406c fsnotify: clean up spinlock assertions
Use assert_spin_locked() macro instead of hand-made BUG_ON statements.

Link: http://lkml.kernel.org/r/1474537439-18919-1-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Suggested-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Jan Kara 0b1b86527d fanotify: fix possible false warning when freeing events
When freeing permission events by fsnotify_destroy_event(), the warning
WARN_ON(!list_empty(&event->list)); may falsely hit.

This is because although fanotify_get_response() saw event->response
set, there is nothing to make sure the current CPU also sees the removal
of the event from the list.  Add proper locking around the WARN_ON() to
avoid the false warning.

Link: http://lkml.kernel.org/r/1473797711-14111-7-git-send-email-jack@suse.cz
Reported-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Jan Kara 073f65522a fanotify: use notification_lock instead of access_lock
Fanotify code has its own lock (access_lock) to protect a list of events
waiting for a response from userspace.

However this is somewhat awkward as the same list_head in the event is
protected by notification_lock if it is part of the notification queue
and by access_lock if it is part of the fanotify private queue which
makes it difficult for any reliable checks in the generic code.  So make
fanotify use the same lock - notification_lock - for protecting its
private event list.

Link: http://lkml.kernel.org/r/1473797711-14111-6-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Jan Kara c21dbe20f6 fsnotify: convert notification_mutex to a spinlock
notification_mutex is used to protect the list of pending events.  As such
there's no reason to use a sleeping lock for it.  Convert it to a
spinlock.

[jack@suse.cz: fixed version]
  Link: http://lkml.kernel.org/r/1474031567-1831-1-git-send-email-jack@suse.cz
Link: http://lkml.kernel.org/r/1473797711-14111-5-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Jan Kara 1404ff3cc3 fsnotify: drop notification_mutex before destroying event
fsnotify_flush_notify() and fanotify_release() destroy notification
event while holding notification_mutex.

The destruction of fanotify event includes a path_put() call which may
end up calling into a filesystem to delete an inode if we happen to be
the last holders of dentry reference which happens to be the last holder
of inode reference.

That in turn may violate lock ordering for some filesystems since
notification_mutex is also acquired e. g. during write when generating
fanotify event.

Also this is the only thing that forces notification_mutex to be a
sleeping lock.  So drop notification_mutex before destroying a
notification event.

Link: http://lkml.kernel.org/r/1473797711-14111-4-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Jan Kara 96d41019e3 fanotify: fix list corruption in fanotify_get_response()
fanotify_get_response() calls fsnotify_remove_event() when it finds that
group is being released from fanotify_release() (bypass_perm is set).

However the event it removes need not be only in the group's notification
queue but it can have already moved to access_list (userspace read the
event before closing the fanotify instance fd) which is protected by a
different lock.  Thus when fsnotify_remove_event() races with
fanotify_release() operating on access_list, the list can get corrupted.

Fix the problem by moving all the logic removing permission events from
the lists to one place - fanotify_release().

Fixes: 5838d4442b ("fanotify: fix double free of pending permission events")
Link: http://lkml.kernel.org/r/1473797711-14111-3-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Jan Kara 12703dbfeb fsnotify: add a way to stop queueing events on group shutdown
Implement a function that can be called when a group is being shutdown
to stop queueing new events to the group.  Fanotify will use this.

Fixes: 5838d4442b ("fanotify: fix double free of pending permission events")
Link: http://lkml.kernel.org/r/1473797711-14111-2-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Jan Kara 35e481761c fsnotify: avoid spurious EMFILE errors from inotify_init()
Inotify instance is destroyed when all references to it are dropped.
That not only means that the corresponding file descriptor needs to be
closed but also that all corresponding instance marks are freed (as each
mark holds a reference to the inotify instance).  However marks are
freed only after SRCU period ends which can take some time and thus if
user rapidly creates and frees inotify instances, number of existing
inotify instances can exceed max_user_instances limit although from user
point of view there is always at most one existing instance.  Thus
inotify_init() returns EMFILE error which is hard to justify from user
point of view.  This problem is exposed by LTP inotify06 testcase on
some machines.

We fix the problem by making sure all group marks are properly freed
while destroying inotify instance.  We wait for SRCU period to end in
that path anyway since we have to make sure there is no event being
added to the instance while we are tearing down the instance.  So it
takes only some plumbing to allow for marks to be destroyed in that path
as well and not from a dedicated work item.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Tested-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Jeff Layton 0918f1c309 fsnotify: turn fsnotify reaper thread into a workqueue job
We don't require a dedicated thread for fsnotify cleanup.  Switch it
over to a workqueue job instead that runs on the system_unbound_wq.

In the interest of not thrashing the queued job too often when there are
a lot of marks being removed, we delay the reaper job slightly when
queueing it, to allow several to gather on the list.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18 16:23:24 -08:00
Jeff Layton 13d34ac6e5 Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread"
This reverts commit c510eff6be ("fsnotify: destroy marks with
call_srcu instead of dedicated thread").

Eryu reported that he was seeing some OOM kills kick in when running a
testcase that adds and removes inotify marks on a file in a tight loop.

The above commit changed the code to use call_srcu to clean up the
marks.  While that does (in principle) work, the srcu callback job is
limited to cleaning up entries in small batches and only once per jiffy.
It's easily possible to overwhelm that machinery with too many call_srcu
callbacks, and Eryu's reproduer did just that.

There's also another potential problem with using call_srcu here.  While
you can obviously sleep while holding the srcu_read_lock, the callbacks
run under local_bh_disable, so you can't sleep there.

It's possible when putting the last reference to the fsnotify_mark that
we'll end up putting a chain of references including the fsnotify_group,
uid, and associated keys.  While I don't see any obvious ways that that
could occurs, it's probably still best to avoid using call_srcu here
after all.

This patch reverts the above patch.  A later patch will take a different
approach to eliminated the dedicated thread here.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reported-by: Eryu Guan <guaneryu@gmail.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Cc: Jan Kara <jack@suse.com>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18 16:23:24 -08:00
Jeff Layton c510eff6be fsnotify: destroy marks with call_srcu instead of dedicated thread
At the time that this code was originally written, call_srcu didn't
exist, so this thread was required to ensure that we waited for that
SRCU grace period to settle before finally freeing the object.

It does exist now however and we can much more efficiently use call_srcu
to handle this.  That also allows us to potentially use srcu_barrier to
ensure that they are all of the callbacks have run before proceeding.
In order to conserve space, we union the rcu_head with the g_list.

This will be necessary for nfsd which will allocate marks from a
dedicated slabcache.  We have to be able to ensure that all of the
objects are destroyed before destroying the cache.  That's fairly

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: Eric Paris <eparis@parisplace.org>
Reviewed-by: Jan Kara <jack@suse.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Geliang Tang 1deaf9d197 fs/notify/inode_mark.c: use list_next_entry in fsnotify_unmount_inodes
To make the intention clearer, use list_next_entry instead of
list_entry.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Dave Hansen d30e2c05a1 inotify: actually check for invalid bits in sys_inotify_add_watch()
The comment here says that it is checking for invalid bits.  But, the mask
is *actually* checking to ensure that _any_ valid bit is set, which is
quite different.

Without this check, an unexpected bit could get set on an inotify object.
Since these bits are also interpreted by the fsnotify/dnotify code, there
is the potential for an object to be mishandled inside the kernel.  For
instance, can we be sure that setting the dnotify flag FS_DN_RENAME on an
inotify watch is harmless?

Add the actual check which was intended.  Retain the existing inotify bits
are being added to the watch.  Plus, this is existing behavior which would
be nice to preserve.

I did a quick sniff test that inotify functions and that my
'inotify-tools' package passes 'make check'.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Dave Hansen 6933599697 inotify: hide internal kernel bits from fdinfo
There was a report that my patch:

    inotify: actually check for invalid bits in sys_inotify_add_watch()

broke CRIU.

The reason is that CRIU looks up raw flags in /proc/$pid/fdinfo/* to
figure out how to rebuild inotify watches and then passes those flags
directly back in to the inotify API.  One of those flags
(FS_EVENT_ON_CHILD) is set in mark->mask, but is not part of the inotify
API.  It is used inside the kernel to _implement_ inotify but it is not
and has never been part of the API.

My patch above ensured that we only allow bits which are part of the API
(IN_ALL_EVENTS).  This broke CRIU.

FS_EVENT_ON_CHILD is really internal to the kernel.  It is set _anyway_ on
all inotify marks.  So, CRIU was really just trying to set a bit that was
already set.

This patch hides that bit from fdinfo.  CRIU will not see the bit, not try
to set it, and should work as before.  We should not have been exposing
this bit in the first place, so this is a good patch independent of the
CRIU problem.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reported-by: Andrey Wagin <avagin@gmail.com>
Acked-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Eric Paris <eparis@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Linus Torvalds 7d9071a095 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "In this one:

   - d_move fixes (Eric Biederman)

   - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
     handling ought to be fixed)

   - switch of sb_writers to percpu rwsem (Oleg Nesterov)

   - superblock scalability (Josef Bacik and Dave Chinner)

   - swapon(2) race fix (Hugh Dickins)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
  vfs: Test for and handle paths that are unreachable from their mnt_root
  dcache: Reduce the scope of i_lock in d_splice_alias
  dcache: Handle escaped paths in prepend_path
  mm: fix potential data race in SyS_swapon
  inode: don't softlockup when evicting inodes
  inode: rename i_wb_list to i_io_list
  sync: serialise per-superblock sync operations
  inode: convert inode_sb_list_lock to per-sb
  inode: add hlist_fake to avoid the inode hash lock in evict
  writeback: plug writeback at a high level
  change sb_writers to use percpu_rw_semaphore
  shift percpu_counter_destroy() into destroy_super_work()
  percpu-rwsem: kill CONFIG_PERCPU_RWSEM
  percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
  percpu-rwsem: introduce percpu_down_read_trylock()
  document rwsem_release() in sb_wait_write()
  fix the broken lockdep logic in __sb_start_write()
  introduce __sb_writers_{acquired,release}() helpers
  ufs_inode_get{frag,block}(): get rid of 'phys' argument
  ufs_getfrag_block(): tidy up a bit
  ...
2015-09-05 20:34:28 -07:00
Jan Kara 4712e722f9 fsnotify: get rid of fsnotify_destroy_mark_locked()
fsnotify_destroy_mark_locked() is subtle to use because it temporarily
releases group->mark_mutex.  To avoid future problems with this
function, split it into two.

fsnotify_detach_mark() is the part that needs group->mark_mutex and
fsnotify_free_mark() is the part that must be called outside of
group->mark_mutex.  This way it's much clearer what's going on and we
also avoid some pointless acquisitions of group->mark_mutex.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Jan Kara 925d1132a0 fsnotify: remove mark->free_list
Free list is used when all marks on given inode / mount should be
destroyed when inode / mount is going away.  However we can free all of
the marks without using a special list with some care.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Jan Kara 3c53e51421 fsnotify: fix check in inotify fdinfo printing
A check in inotify_fdinfo() checking whether mark is valid was always
true due to a bug.  Luckily we can never get to invalidated marks since
we hold mark_mutex and invalidated marks get removed from the group list
when they are invalidated under that mutex.

Anyway fix the check to make code more future proof.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Dave Hansen 7c49b86164 fs/notify: optimize inotify/fsnotify code for unwatched files
I have a _tiny_ microbenchmark that sits in a loop and writes single
bytes to a file.  Writing one byte to a tmpfs file is around 2x slower
than reading one byte from a file, which is a _bit_ more than I expecte.
This is a dumb benchmark, but I think it's hard to deny that write() is
a hot path and we should avoid unnecessary overhead there.

I did a 'perf record' of 30-second samples of read and write.  The top
item in a diffprofile is srcu_read_lock() from fsnotify().  There are
active inotify fd's from systemd, but nothing is actually listening to
the file or its part of the filesystem.

I *think* we can avoid taking the srcu_read_lock() for the common case
where there are no actual marks on the file.  This means that there will
both be nothing to notify for *and* implies that there is no need for
clearing the ignore mask.

This patch gave a 13.1% speedup in writes/second on my test, which is an
improvement from the 10.8% that I saw with the last version.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Dave Chinner 74278da9f7 inode: convert inode_sb_list_lock to per-sb
The process of reducing contention on per-superblock inode lists
starts with moving the locking to match the per-superblock inode
list. This takes the global lock out of the picture and reduces the
contention problems to within a single filesystem. This doesn't get
rid of contention as the locks still have global CPU scope, but it
does isolate operations on different superblocks form each other.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>
2015-08-17 18:39:46 -04:00
Jan Kara 8f2f3eb59d fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()
fsnotify_clear_marks_by_group_flags() can race with
fsnotify_destroy_marks() so that when fsnotify_destroy_mark_locked()
drops mark_mutex, a mark from the list iterated by
fsnotify_clear_marks_by_group_flags() can be freed and thus the next
entry pointer we have cached may become stale and we dereference free
memory.

Fix the problem by first moving marks to free to a special private list
and then always free the first entry in the special list.  This method
is safe even when entries from the list can disappear once we drop the
lock.

Signed-off-by: Jan Kara <jack@suse.com>
Reported-by: Ashish Sangwan <a.sangwan@samsung.com>
Reviewed-by: Ashish Sangwan <a.sangwan@samsung.com>
Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 04:39:41 +03:00
Linus Torvalds d725e66c06 Revert "fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()"
This reverts commit a2673b6e04.

Kinglong Mee reports a memory leak with that patch, and Jan Kara confirms:

 "Thanks for report! You are right that my patch introduces a race
  between fsnotify kthread and fsnotify_destroy_group() which can result
  in leaking inotify event on group destruction.

  I haven't yet decided whether the right fix is not to queue events for
  dying notification group (as that is pointless anyway) or whether we
  should just fix the original problem differently...  Whenever I look
  at fsnotify code mark handling I get lost in the maze of locks, lists,
  and subtle differences between how different notification systems
  handle notification marks :( I'll think about it over night"

and after thinking about it, Jan says:

 "OK, I have looked into the code some more and I found another
  relatively simple way of fixing the original oops.  It will be IMHO
  better than trying to fixup this issue which has more potential for
  breakage.  I'll ask Linus to revert the fsnotify fix he already merged
  and send a new fix"

Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Requested-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-21 16:06:53 -07:00
Jan Kara a2673b6e04 fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()
fsnotify_clear_marks_by_group_flags() can race with
fsnotify_destroy_marks() so when fsnotify_destroy_mark_locked() drops
mark_mutex, a mark from the list iterated by
fsnotify_clear_marks_by_group_flags() can be freed and we dereference free
memory in the loop there.

Fix the problem by keeping mark_mutex held in
fsnotify_destroy_mark_locked().  The reason why we drop that mutex is that
we need to call a ->freeing_mark() callback which may acquire mark_mutex
again.  To avoid this and similar lock inversion issues, we move the call
to ->freeing_mark() callback to the kthread destroying the mark.

Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Ashish Sangwan <a.sangwan@samsung.com>
Suggested-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-17 16:39:54 -07:00