linux-stable/fs/gfs2/ops_fstype.c

1824 lines
46 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0-only
/*
* Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved.
* Copyright (C) 2004-2008 Red Hat, Inc. All rights reserved.
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/completion.h>
#include <linux/buffer_head.h>
#include <linux/blkdev.h>
#include <linux/kthread.h>
#include <linux/export.h>
#include <linux/namei.h>
#include <linux/mount.h>
#include <linux/gfs2_ondisk.h>
#include <linux/quotaops.h>
#include <linux/lockdep.h>
fs: Limit sys_mount to only request filesystem modules. Modify the request_module to prefix the file system type with "fs-" and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Reported-by: Kees Cook <keescook@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-03 03:39:14 +00:00
#include <linux/module.h>
#include <linux/backing-dev.h>
#include <linux/fs_parser.h>
#include "gfs2.h"
#include "incore.h"
#include "bmap.h"
#include "glock.h"
#include "glops.h"
#include "inode.h"
#include "recovery.h"
#include "rgrp.h"
#include "super.h"
#include "sys.h"
#include "util.h"
#include "log.h"
#include "quota.h"
#include "dir.h"
#include "meta_io.h"
#include "trace_gfs2.h"
#include "lops.h"
#define DO 0
#define UNDO 1
/**
* gfs2_tune_init - Fill a gfs2_tune structure with default values
* @gt: tune
*
*/
static void gfs2_tune_init(struct gfs2_tune *gt)
{
spin_lock_init(&gt->gt_spin);
gt->gt_quota_warn_period = 10;
gt->gt_quota_scale_num = 1;
gt->gt_quota_scale_den = 1;
gt->gt_new_files_jdata = 0;
gt->gt_max_readahead = BIT(18);
gt->gt_complain_secs = 10;
}
void free_sbd(struct gfs2_sbd *sdp)
{
if (sdp->sd_lkstats)
free_percpu(sdp->sd_lkstats);
kfree(sdp);
}
static struct gfs2_sbd *init_sbd(struct super_block *sb)
{
struct gfs2_sbd *sdp;
struct address_space *mapping;
sdp = kzalloc(sizeof(struct gfs2_sbd), GFP_KERNEL);
if (!sdp)
return NULL;
sdp->sd_vfs = sb;
GFS2: glock statistics gathering The stats are divided into two sets: those relating to the super block and those relating to an individual glock. The super block stats are done on a per cpu basis in order to try and reduce the overhead of gathering them. They are also further divided by glock type. In the case of both the super block and glock statistics, the same information is gathered in each case. The super block statistics are used to provide default values for most of the glock statistics, so that newly created glocks should have, as far as possible, a sensible starting point. The statistics are divided into three pairs of mean and variance, plus two counters. The mean/variance pairs are smoothed exponential estimates and the algorithm used is one which will be very familiar to those used to calculation of round trip times in network code. The three pairs of mean/variance measure the following things: 1. DLM lock time (non-blocking requests) 2. DLM lock time (blocking requests) 3. Inter-request time (again to the DLM) A non-blocking request is one which will complete right away, whatever the state of the DLM lock in question. That currently means any requests when (a) the current state of the lock is exclusive (b) the requested state is either null or unlocked or (c) the "try lock" flag is set. A blocking request covers all the other lock requests. There are two counters. The first is there primarily to show how many lock requests have been made, and thus how much data has gone into the mean/variance calculations. The other counter is counting queueing of holders at the top layer of the glock code. Hopefully that number will be a lot larger than the number of dlm lock requests issued. So why gather these statistics? There are several reasons we'd like to get a better idea of these timings: 1. To be able to better set the glock "min hold time" 2. To spot performance issues more easily 3. To improve the algorithm for selecting resource groups for allocation (to base it on lock wait time, rather than blindly using a "try lock") Due to the smoothing action of the updates, a step change in some input quantity being sampled will only fully be taken into account after 8 samples (or 4 for the variance) and this needs to be carefully considered when interpreting the results. Knowing both the time it takes a lock request to complete and the average time between lock requests for a glock means we can compute the total percentage of the time for which the node is able to use a glock vs. time that the rest of the cluster has its share. That will be very useful when setting the lock min hold time. The other point to remember is that all times are in nanoseconds. Great care has been taken to ensure that we measure exactly the quantities that we want, as accurately as possible. There are always inaccuracies in any measuring system, but I hope this is as accurate as we can reasonably make it. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-01-20 10:38:36 +00:00
sdp->sd_lkstats = alloc_percpu(struct gfs2_pcpu_lkstats);
if (!sdp->sd_lkstats)
goto fail;
sb->s_fs_info = sdp;
GFS2: glock statistics gathering The stats are divided into two sets: those relating to the super block and those relating to an individual glock. The super block stats are done on a per cpu basis in order to try and reduce the overhead of gathering them. They are also further divided by glock type. In the case of both the super block and glock statistics, the same information is gathered in each case. The super block statistics are used to provide default values for most of the glock statistics, so that newly created glocks should have, as far as possible, a sensible starting point. The statistics are divided into three pairs of mean and variance, plus two counters. The mean/variance pairs are smoothed exponential estimates and the algorithm used is one which will be very familiar to those used to calculation of round trip times in network code. The three pairs of mean/variance measure the following things: 1. DLM lock time (non-blocking requests) 2. DLM lock time (blocking requests) 3. Inter-request time (again to the DLM) A non-blocking request is one which will complete right away, whatever the state of the DLM lock in question. That currently means any requests when (a) the current state of the lock is exclusive (b) the requested state is either null or unlocked or (c) the "try lock" flag is set. A blocking request covers all the other lock requests. There are two counters. The first is there primarily to show how many lock requests have been made, and thus how much data has gone into the mean/variance calculations. The other counter is counting queueing of holders at the top layer of the glock code. Hopefully that number will be a lot larger than the number of dlm lock requests issued. So why gather these statistics? There are several reasons we'd like to get a better idea of these timings: 1. To be able to better set the glock "min hold time" 2. To spot performance issues more easily 3. To improve the algorithm for selecting resource groups for allocation (to base it on lock wait time, rather than blindly using a "try lock") Due to the smoothing action of the updates, a step change in some input quantity being sampled will only fully be taken into account after 8 samples (or 4 for the variance) and this needs to be carefully considered when interpreting the results. Knowing both the time it takes a lock request to complete and the average time between lock requests for a glock means we can compute the total percentage of the time for which the node is able to use a glock vs. time that the rest of the cluster has its share. That will be very useful when setting the lock min hold time. The other point to remember is that all times are in nanoseconds. Great care has been taken to ensure that we measure exactly the quantities that we want, as accurately as possible. There are always inaccuracies in any measuring system, but I hope this is as accurate as we can reasonably make it. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-01-20 10:38:36 +00:00
set_bit(SDF_NOJOURNALID, &sdp->sd_flags);
gfs2_tune_init(&sdp->sd_tune);
init_waitqueue_head(&sdp->sd_kill_wait);
gfs2: Use async glocks for rename Because s_vfs_rename_mutex is not cluster-wide, multiple nodes can reverse the roles of which directories are "old" and which are "new" for the purposes of rename. This can cause deadlocks where two nodes end up waiting for each other. There can be several layers of directory dependencies across many nodes. This patch fixes the problem by acquiring all gfs2_rename's inode glocks asychronously and waiting for all glocks to be acquired. That way all inodes are locked regardless of the order. The timeout value for multiple asynchronous glocks is calculated to be the total of the individual wait times for each glock times two. Since gfs2_exchange is very similar to gfs2_rename, both functions are patched in the same way. A new async glock wait queue, sd_async_glock_wait, keeps a list of waiters for these events. If gfs2's holder_wake function detects an async holder, it wakes up any waiters for the event. The waiter only tests whether any of its requests are still pending. Since the glocks are sent to dlm asychronously, the wait function needs to check to see which glocks, if any, were granted. If a glock is granted by dlm (and therefore held), its minimum hold time is checked and adjusted as necessary, as other glock grants do. If the event times out, all glocks held thus far must be dequeued to resolve any existing deadlocks. Then, if there are any outstanding locking requests, we need to loop around and wait for dlm to respond to those requests too. After we release all requests, we return -ESTALE to the caller (vfs rename) which loops around and retries the request. Node1 Node2 --------- --------- 1. Enqueue A Enqueue B 2. Enqueue B Enqueue A 3. A granted 6. B granted 7. Wait for B 8. Wait for A 9. A times out (since Node 1 holds A) 10. Dequeue B (since it was granted) 11. Wait for all requests from DLM 12. B Granted (since Node2 released it in step 10) 13. Rename 14. Dequeue A 15. DLM Grants A 16. Dequeue A (due to the timeout and since we no longer have B held for our task). 17. Dequeue B 18. Return -ESTALE to vfs 19. VFS retries the operation, goto step 1. This release-all-locks / acquire-all-locks may slow rename / exchange down as both nodes struggle in the same way and do the same thing. However, this will only happen when there is contention for the same inodes, which ought to be rare. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-08-30 17:31:02 +00:00
init_waitqueue_head(&sdp->sd_async_glock_wait);
atomic_set(&sdp->sd_glock_disposal, 0);
init_completion(&sdp->sd_locking_init);
init_completion(&sdp->sd_wdack);
spin_lock_init(&sdp->sd_statfs_spin);
spin_lock_init(&sdp->sd_rindex_spin);
GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme Here is an update of Bob's original rbtree patch which, in addition, also resolves the rather strange ref counting that was being done relating to the bitmap blocks. Originally we had a dual system for journaling resource groups. The metadata blocks were journaled and also the rgrp itself was added to a list. The reason for adding the rgrp to the list in the journal was so that the "repolish clones" code could be run to update the free space, and potentially send any discard requests when the log was flushed. This was done by comparing the "cloned" bitmap with what had been written back on disk during the transaction commit. Due to this, there was a requirement to hang on to the rgrps' bitmap buffers until the journal had been flushed. For that reason, there was a rather complicated set up in the ->go_lock ->go_unlock functions for rgrps involving both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference count on the buffers. However, the journal maintains a reference count on the buffers anyway, since they are being journaled as metadata buffers. So by moving the code which deals with the post-journal accounting for bitmap blocks to the metadata journaling code, we can entirely dispense with the rather strange buffer ref counting scheme and also the requirement to journal the rgrps. The net result of all this is that the ->sd_rindex_spin is left to do exactly one job, and that is to look after the rbtree or rgrps. This patch is designed to be a stepping stone towards using RCU for the rbtree of resource groups, however the reduction in the number of uses of the ->sd_rindex_spin is likely to have benefits for multi-threaded workloads, anyway. The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also be removed in future in favour of calling the functions directly where required in the code. That will allow locking of resource groups without needing to actually read them in - something that could be useful in speeding up statfs. In the mean time though it is valid to dereference ->bi_bh only when the rgrp is locked. This is basically the same rule as before, modulo the references not being valid until the following journal flush. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Cc: Benjamin Marzinski <bmarzins@redhat.com>
2011-08-31 08:53:19 +00:00
sdp->sd_rindex_tree.rb_node = NULL;
INIT_LIST_HEAD(&sdp->sd_jindex_list);
spin_lock_init(&sdp->sd_jindex_spin);
mutex_init(&sdp->sd_jindex_mutex);
init_completion(&sdp->sd_journal_ready);
INIT_LIST_HEAD(&sdp->sd_quota_list);
mutex_init(&sdp->sd_quota_mutex);
mutex_init(&sdp->sd_quota_sync_mutex);
init_waitqueue_head(&sdp->sd_quota_wait);
spin_lock_init(&sdp->sd_bitmap_lock);
INIT_LIST_HEAD(&sdp->sd_sc_inodes_list);
mapping = &sdp->sd_aspace;
address_space_init_once(mapping);
mapping->a_ops = &gfs2_rgrp_aops;
mapping->host = sb->s_bdev->bd_inode;
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_NOFS);
mapping->i_private_data = NULL;
mapping->writeback_index = 0;
spin_lock_init(&sdp->sd_log_lock);
atomic_set(&sdp->sd_log_pinned, 0);
INIT_LIST_HEAD(&sdp->sd_log_revokes);
INIT_LIST_HEAD(&sdp->sd_log_ordered);
spin_lock_init(&sdp->sd_ordered_lock);
init_waitqueue_head(&sdp->sd_log_waitq);
init_waitqueue_head(&sdp->sd_logd_waitq);
spin_lock_init(&sdp->sd_ail_lock);
INIT_LIST_HEAD(&sdp->sd_ail1_list);
INIT_LIST_HEAD(&sdp->sd_ail2_list);
init_rwsem(&sdp->sd_log_flush_lock);
atomic_set(&sdp->sd_log_in_flight, 0);
init_waitqueue_head(&sdp->sd_log_flush_wait);
mutex_init(&sdp->sd_freeze_mutex);
return sdp;
fail:
free_sbd(sdp);
return NULL;
}
/**
* gfs2_check_sb - Check superblock
* @sdp: the filesystem
* @silent: Don't print a message if the check fails
*
* Checks the version code of the FS is one that we understand how to
* read and that the sizes of the various on-disk structures have not
* changed.
*/
static int gfs2_check_sb(struct gfs2_sbd *sdp, int silent)
{
struct gfs2_sb_host *sb = &sdp->sd_sb;
if (sb->sb_magic != GFS2_MAGIC ||
sb->sb_type != GFS2_METATYPE_SB) {
if (!silent)
pr_warn("not a GFS2 filesystem\n");
return -EINVAL;
}
if (sb->sb_fs_format < GFS2_FS_FORMAT_MIN ||
sb->sb_fs_format > GFS2_FS_FORMAT_MAX ||
sb->sb_multihost_format != GFS2_FORMAT_MULTI) {
fs_warn(sdp, "Unknown on-disk format, unable to mount\n");
return -EINVAL;
}
if (sb->sb_bsize < 512 || sb->sb_bsize > PAGE_SIZE ||
(sb->sb_bsize & (sb->sb_bsize - 1))) {
pr_warn("Invalid block size\n");
return -EINVAL;
}
if (sb->sb_bsize_shift != ffs(sb->sb_bsize) - 1) {
pr_warn("Invalid block size shift\n");
return -EINVAL;
}
return 0;
}
static void end_bio_io_page(struct bio *bio)
{
struct page *page = bio->bi_private;
if (!bio->bi_status)
SetPageUptodate(page);
else
pr_warn("error %d reading superblock\n", bio->bi_status);
unlock_page(page);
}
static void gfs2_sb_in(struct gfs2_sbd *sdp, const void *buf)
{
struct gfs2_sb_host *sb = &sdp->sd_sb;
struct super_block *s = sdp->sd_vfs;
const struct gfs2_sb *str = buf;
sb->sb_magic = be32_to_cpu(str->sb_header.mh_magic);
sb->sb_type = be32_to_cpu(str->sb_header.mh_type);
sb->sb_fs_format = be32_to_cpu(str->sb_fs_format);
sb->sb_multihost_format = be32_to_cpu(str->sb_multihost_format);
sb->sb_bsize = be32_to_cpu(str->sb_bsize);
sb->sb_bsize_shift = be32_to_cpu(str->sb_bsize_shift);
sb->sb_master_dir.no_addr = be64_to_cpu(str->sb_master_dir.no_addr);
sb->sb_master_dir.no_formal_ino = be64_to_cpu(str->sb_master_dir.no_formal_ino);
sb->sb_root_dir.no_addr = be64_to_cpu(str->sb_root_dir.no_addr);
sb->sb_root_dir.no_formal_ino = be64_to_cpu(str->sb_root_dir.no_formal_ino);
memcpy(sb->sb_lockproto, str->sb_lockproto, GFS2_LOCKNAME_LEN);
memcpy(sb->sb_locktable, str->sb_locktable, GFS2_LOCKNAME_LEN);
super_set_uuid(s, str->sb_uuid, 16);
}
/**
* gfs2_read_super - Read the gfs2 super block from disk
* @sdp: The GFS2 super block
* @sector: The location of the super block
* @silent: Don't print a message if the check fails
*
* This uses the bio functions to read the super block from disk
* because we want to be 100% sure that we never read cached data.
* A super block is read twice only during each GFS2 mount and is
* never written to by the filesystem. The first time its read no
* locks are held, and the only details which are looked at are those
* relating to the locking protocol. Once locking is up and working,
* the sb is read again under the lock to establish the location of
* the master directory (contains pointers to journals etc) and the
* root directory.
*
* Returns: 0 on success or error
*/
static int gfs2_read_super(struct gfs2_sbd *sdp, sector_t sector, int silent)
{
struct super_block *sb = sdp->sd_vfs;
struct gfs2_sb *p;
struct page *page;
struct bio *bio;
page = alloc_page(GFP_NOFS);
if (unlikely(!page))
return -ENOMEM;
ClearPageUptodate(page);
ClearPageDirty(page);
lock_page(page);
bio = bio_alloc(sb->s_bdev, 1, REQ_OP_READ | REQ_META, GFP_NOFS);
block: Abstract out bvec iterator Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6
2013-10-11 22:44:27 +00:00
bio->bi_iter.bi_sector = sector * (sb->s_blocksize >> 9);
__bio_add_page(bio, page, PAGE_SIZE, 0);
bio->bi_end_io = end_bio_io_page;
bio->bi_private = page;
submit_bio(bio);
wait_on_page_locked(page);
bio_put(bio);
if (!PageUptodate(page)) {
__free_page(page);
return -EIO;
}
p = kmap(page);
gfs2_sb_in(sdp, p);
kunmap(page);
__free_page(page);
return gfs2_check_sb(sdp, silent);
}
/**
* gfs2_read_sb - Read super block
* @sdp: The GFS2 superblock
* @silent: Don't print message if mount fails
*
*/
static int gfs2_read_sb(struct gfs2_sbd *sdp, int silent)
{
u32 hash_blocks, ind_blocks, leaf_blocks;
u32 tmp_blocks;
unsigned int x;
int error;
error = gfs2_read_super(sdp, GFS2_SB_ADDR >> sdp->sd_fsb2bb_shift, silent);
if (error) {
if (!silent)
fs_err(sdp, "can't read superblock\n");
return error;
}
sdp->sd_fsb2bb_shift = sdp->sd_sb.sb_bsize_shift - 9;
sdp->sd_fsb2bb = BIT(sdp->sd_fsb2bb_shift);
sdp->sd_diptrs = (sdp->sd_sb.sb_bsize -
sizeof(struct gfs2_dinode)) / sizeof(u64);
sdp->sd_inptrs = (sdp->sd_sb.sb_bsize -
sizeof(struct gfs2_meta_header)) / sizeof(u64);
sdp->sd_ldptrs = (sdp->sd_sb.sb_bsize -
sizeof(struct gfs2_log_descriptor)) / sizeof(u64);
sdp->sd_jbsize = sdp->sd_sb.sb_bsize - sizeof(struct gfs2_meta_header);
sdp->sd_hash_bsize = sdp->sd_sb.sb_bsize / 2;
sdp->sd_hash_bsize_shift = sdp->sd_sb.sb_bsize_shift - 1;
sdp->sd_hash_ptrs = sdp->sd_hash_bsize / sizeof(u64);
sdp->sd_qc_per_block = (sdp->sd_sb.sb_bsize -
sizeof(struct gfs2_meta_header)) /
sizeof(struct gfs2_quota_change);
GFS2: Speed up gfs2_rbm_from_block This patch is a rewrite of function gfs2_rbm_from_block. Rather than looping to find the right bitmap, the code now does a few simple math calculations. I compared the performance of both algorithms side by side and the new algorithm is noticeably faster. Sample instrumentation output from a "fast" machine: 5 million calls: millisec spent: Orig: 166 New: 113 5 million calls: millisec spent: Orig: 189 New: 114 In addition, I ran postmark (on a somewhat slowr CPU) before the after the new algorithm was put in place and postmark showed a decent improvement: Before the new algorithm: ------------------------- Time: 645 seconds total 584 seconds of transactions (171 per second) Files: 150087 created (232 per second) Creation alone: 100000 files (2083 per second) Mixed with transactions: 50087 files (85 per second) 49995 read (85 per second) 49991 appended (85 per second) 150087 deleted (232 per second) Deletion alone: 100174 files (7705 per second) Mixed with transactions: 49913 files (85 per second) Data: 273.42 megabytes read (434.08 kilobytes per second) 852.13 megabytes written (1.32 megabytes per second) With the new algorithm: ----------------------- Time: 599 seconds total 530 seconds of transactions (188 per second) Files: 150087 created (250 per second) Creation alone: 100000 files (1886 per second) Mixed with transactions: 50087 files (94 per second) 49995 read (94 per second) 49991 appended (94 per second) 150087 deleted (250 per second) Deletion alone: 100174 files (6260 per second) Mixed with transactions: 49913 files (94 per second) Data: 273.42 megabytes read (467.42 kilobytes per second) 852.13 megabytes written (1.42 megabytes per second) Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-10-19 12:32:51 +00:00
sdp->sd_blocks_per_bitmap = (sdp->sd_sb.sb_bsize -
sizeof(struct gfs2_meta_header))
* GFS2_NBBY; /* not the rgrp bitmap, subsequent bitmaps only */
/*
* We always keep at least one block reserved for revokes in
* transactions. This greatly simplifies allocating additional
* revoke blocks.
*/
atomic_set(&sdp->sd_log_revokes_available, sdp->sd_ldptrs);
/* Compute maximum reservation required to add a entry to a directory */
hash_blocks = DIV_ROUND_UP(sizeof(u64) * BIT(GFS2_DIR_MAX_DEPTH),
sdp->sd_jbsize);
ind_blocks = 0;
for (tmp_blocks = hash_blocks; tmp_blocks > sdp->sd_diptrs;) {
tmp_blocks = DIV_ROUND_UP(tmp_blocks, sdp->sd_inptrs);
ind_blocks += tmp_blocks;
}
leaf_blocks = 2 + GFS2_DIR_MAX_DEPTH;
sdp->sd_max_dirres = hash_blocks + ind_blocks + leaf_blocks;
sdp->sd_heightsize[0] = sdp->sd_sb.sb_bsize -
sizeof(struct gfs2_dinode);
sdp->sd_heightsize[1] = sdp->sd_sb.sb_bsize * sdp->sd_diptrs;
for (x = 2;; x++) {
u64 space, d;
u32 m;
space = sdp->sd_heightsize[x - 1] * sdp->sd_inptrs;
d = space;
m = do_div(d, sdp->sd_inptrs);
if (d != sdp->sd_heightsize[x - 1] || m)
break;
sdp->sd_heightsize[x] = space;
}
sdp->sd_max_height = x;
sdp->sd_heightsize[x] = ~0;
gfs2_assert(sdp, sdp->sd_max_height <= GFS2_MAX_META_HEIGHT);
gfs2: change gfs2 readdir cookie gfs2 currently returns 31 bits of filename hash as a cookie that readdir uses for an offset into the directory. When there are a large number of directory entries, the likelihood of a collision goes up way too quickly. GFS2 will now return cookies that are guaranteed unique for a while, and then fail back to using 30 bits of filename hash. Specifically, the directory leaf blocks are divided up into chunks based on the minimum size of a gfs2 directory entry (48 bytes). Each entry's cookie is based off the chunk where it starts, in the linked list of leaf blocks that it hashes to (there are 131072 hash buckets). Directory entries will have unique names until they take reach chunk 8192. Assuming the largest filenames possible, and the least efficient spacing possible, this new method will still be able to return unique names when the previous method has statistically more than a 99% chance of a collision. The non-unique names it fails back to are guaranteed to not collide with the unique names. unique cookies will be in this format: - 1 bit "0" to make sure the the returned cookie is positive - 17 bits for the hash table index - 1 bit for the mode "0" - 13 bits for the offset non-unique cookies will be in this format: - 1 bit "0" to make sure the the returned cookie is positive - 17 bits for the hash table index - 1 bit for the mode "1" - 13 more bits of the name hash Another benefit of location based cookies, is that once a directory's exhash table is fully extended (so that multiple hash table indexs do not use the same leaf blocks), gfs2 can skip sorting the directory entries until it reaches the non-unique ones, and then it only needs to sort these. This provides a significant speed up for directory reads of very large directories. The only issue is that for these cookies to continue to point to the correct entry as files are added and removed from the directory, gfs2 must keep the entries at the same offset in the leaf block when they are split (see my previous patch). This means that until all the nodes in a cluster are running with code that will split the directory leaf blocks this way, none of the nodes can use the new cookie code. To deal with this, gfs2 now has the mount option loccookie, which, if set, will make it return these new location based cookies. This option must not be set until all nodes in the cluster are at least running this version of the kernel code, and you have guaranteed that there are no outstanding cookies required by other software, such as NFS. gfs2 uses some of the extra space at the end of the gfs2_dirent structure to store the calculated readdir cookies. This keeps us from needing to allocate a seperate array to hold these values. gfs2 recomputes the cookie stored in de_cookie for every readdir call. The time it takes to do so is small, and if gfs2 expected this value to be saved on disk, the new code wouldn't work correctly on filesystems created with an earlier version of gfs2. One issue with adding de_cookie to the union in the gfs2_dirent structure is that it caused the union to align itself to a 4 byte boundary, instead of its previous 2 byte boundary. This changed the offset of de_rahead. To solve that, I pulled de_rahead out of the union, since it does not need to be there. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-01 14:46:55 +00:00
sdp->sd_max_dents_per_leaf = (sdp->sd_sb.sb_bsize -
sizeof(struct gfs2_leaf)) /
GFS2_MIN_DIRENT_SIZE;
return 0;
}
static int init_names(struct gfs2_sbd *sdp, int silent)
{
char *proto, *table;
int error = 0;
proto = sdp->sd_args.ar_lockproto;
table = sdp->sd_args.ar_locktable;
/* Try to autodetect */
if (!proto[0] || !table[0]) {
error = gfs2_read_super(sdp, GFS2_SB_ADDR >> sdp->sd_fsb2bb_shift, silent);
if (error)
return error;
if (!proto[0])
proto = sdp->sd_sb.sb_lockproto;
if (!table[0])
table = sdp->sd_sb.sb_locktable;
}
if (!table[0])
table = sdp->sd_vfs->s_id;
BUILD_BUG_ON(GFS2_LOCKNAME_LEN > GFS2_FSNAME_LEN);
strscpy(sdp->sd_proto_name, proto, GFS2_LOCKNAME_LEN);
strscpy(sdp->sd_table_name, table, GFS2_LOCKNAME_LEN);
table = sdp->sd_table_name;
while ((table = strchr(table, '/')))
*table = '_';
return error;
}
static int init_locking(struct gfs2_sbd *sdp, struct gfs2_holder *mount_gh,
int undo)
{
int error = 0;
if (undo)
goto fail_trans;
error = gfs2_glock_nq_num(sdp,
GFS2_MOUNT_LOCK, &gfs2_nondisk_glops,
LM_ST_EXCLUSIVE,
LM_FLAG_NOEXP | GL_NOCACHE | GL_NOPID,
mount_gh);
if (error) {
fs_err(sdp, "can't acquire mount glock: %d\n", error);
goto fail;
}
error = gfs2_glock_nq_num(sdp,
GFS2_LIVE_LOCK, &gfs2_nondisk_glops,
LM_ST_SHARED,
LM_FLAG_NOEXP | GL_EXACT | GL_NOPID,
&sdp->sd_live_gh);
if (error) {
fs_err(sdp, "can't acquire live glock: %d\n", error);
goto fail_mount;
}
error = gfs2_glock_get(sdp, GFS2_RENAME_LOCK, &gfs2_nondisk_glops,
CREATE, &sdp->sd_rename_gl);
if (error) {
fs_err(sdp, "can't create rename glock: %d\n", error);
goto fail_live;
}
GFS2: remove transaction glock GFS2 has a transaction glock, which must be grabbed for every transaction, whose purpose is to deal with freezing the filesystem. Aside from this involving a large amount of locking, it is very easy to make the current fsfreeze code hang on unfreezing. This patch rewrites how gfs2 handles freezing the filesystem. The transaction glock is removed. In it's place is a freeze glock, which is cached (but not held) in a shared state by every node in the cluster when the filesystem is mounted. This lock only needs to be grabbed on freezing, and actions which need to be safe from freezing, like recovery. When a node wants to freeze the filesystem, it grabs this glock exclusively. When the freeze glock state changes on the nodes (either from shared to unlocked, or shared to exclusive), the filesystem does a special log flush. gfs2_log_flush() does all the work for flushing out the and shutting down the incore log, and then it tries to grab the freeze glock in a shared state again. Since the filesystem is stuck in gfs2_log_flush, no new transaction can start, and nothing can be written to disk. Unfreezing the filesytem simply involes dropping the freeze glock, allowing gfs2_log_flush() to grab and then release the shared lock, so it is cached for next time. However, in order for the unfreezing ioctl to occur, gfs2 needs to get a shared lock on the filesystem root directory inode to check permissions. If that glock has already been grabbed exclusively, fsfreeze will be unable to get the shared lock and unfreeze the filesystem. In order to allow the unfreeze, this patch makes gfs2 grab a shared lock on the filesystem root directory during the freeze, and hold it until it unfreezes the filesystem. The functions which need to grab a shared lock in order to allow the unfreeze ioctl to be issued now use the lock grabbed by the freeze code instead. The freeze and unfreeze code take care to make sure that this shared lock will not be dropped while another process is using it. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-02 03:26:55 +00:00
error = gfs2_glock_get(sdp, GFS2_FREEZE_LOCK, &gfs2_freeze_glops,
CREATE, &sdp->sd_freeze_gl);
if (error) {
fs_err(sdp, "can't create freeze glock: %d\n", error);
goto fail_rename;
}
return 0;
fail_trans:
GFS2: remove transaction glock GFS2 has a transaction glock, which must be grabbed for every transaction, whose purpose is to deal with freezing the filesystem. Aside from this involving a large amount of locking, it is very easy to make the current fsfreeze code hang on unfreezing. This patch rewrites how gfs2 handles freezing the filesystem. The transaction glock is removed. In it's place is a freeze glock, which is cached (but not held) in a shared state by every node in the cluster when the filesystem is mounted. This lock only needs to be grabbed on freezing, and actions which need to be safe from freezing, like recovery. When a node wants to freeze the filesystem, it grabs this glock exclusively. When the freeze glock state changes on the nodes (either from shared to unlocked, or shared to exclusive), the filesystem does a special log flush. gfs2_log_flush() does all the work for flushing out the and shutting down the incore log, and then it tries to grab the freeze glock in a shared state again. Since the filesystem is stuck in gfs2_log_flush, no new transaction can start, and nothing can be written to disk. Unfreezing the filesytem simply involes dropping the freeze glock, allowing gfs2_log_flush() to grab and then release the shared lock, so it is cached for next time. However, in order for the unfreezing ioctl to occur, gfs2 needs to get a shared lock on the filesystem root directory inode to check permissions. If that glock has already been grabbed exclusively, fsfreeze will be unable to get the shared lock and unfreeze the filesystem. In order to allow the unfreeze, this patch makes gfs2 grab a shared lock on the filesystem root directory during the freeze, and hold it until it unfreezes the filesystem. The functions which need to grab a shared lock in order to allow the unfreeze ioctl to be issued now use the lock grabbed by the freeze code instead. The freeze and unfreeze code take care to make sure that this shared lock will not be dropped while another process is using it. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-02 03:26:55 +00:00
gfs2_glock_put(sdp->sd_freeze_gl);
fail_rename:
gfs2_glock_put(sdp->sd_rename_gl);
fail_live:
gfs2_glock_dq_uninit(&sdp->sd_live_gh);
fail_mount:
gfs2_glock_dq_uninit(mount_gh);
fail:
return error;
}
static int gfs2_lookup_root(struct super_block *sb, struct dentry **dptr,
u64 no_addr, const char *name)
{
struct gfs2_sbd *sdp = sb->s_fs_info;
struct dentry *dentry;
struct inode *inode;
inode = gfs2_inode_lookup(sb, DT_DIR, no_addr, 0,
GFS2_BLKST_FREE /* ignore */);
if (IS_ERR(inode)) {
fs_err(sdp, "can't read in %s inode: %ld\n", name, PTR_ERR(inode));
return PTR_ERR(inode);
}
dentry = d_make_root(inode);
if (!dentry) {
fs_err(sdp, "can't alloc %s dentry\n", name);
return -ENOMEM;
}
*dptr = dentry;
return 0;
}
static int init_sb(struct gfs2_sbd *sdp, int silent)
{
struct super_block *sb = sdp->sd_vfs;
struct gfs2_holder sb_gh;
u64 no_addr;
int ret;
ret = gfs2_glock_nq_num(sdp, GFS2_SB_LOCK, &gfs2_meta_glops,
LM_ST_SHARED, 0, &sb_gh);
if (ret) {
fs_err(sdp, "can't acquire superblock glock: %d\n", ret);
return ret;
}
ret = gfs2_read_sb(sdp, silent);
if (ret) {
fs_err(sdp, "can't read superblock: %d\n", ret);
goto out;
}
switch(sdp->sd_sb.sb_fs_format) {
case GFS2_FS_FORMAT_MAX:
sb->s_xattr = gfs2_xattr_handlers_max;
break;
case GFS2_FS_FORMAT_MIN:
sb->s_xattr = gfs2_xattr_handlers_min;
break;
default:
BUG();
}
/* Set up the buffer cache and SB for real */
if (sdp->sd_sb.sb_bsize < bdev_logical_block_size(sb->s_bdev)) {
ret = -EINVAL;
fs_err(sdp, "FS block size (%u) is too small for device "
"block size (%u)\n",
sdp->sd_sb.sb_bsize, bdev_logical_block_size(sb->s_bdev));
goto out;
}
if (sdp->sd_sb.sb_bsize > PAGE_SIZE) {
ret = -EINVAL;
fs_err(sdp, "FS block size (%u) is too big for machine "
"page size (%u)\n",
sdp->sd_sb.sb_bsize, (unsigned int)PAGE_SIZE);
goto out;
}
sb_set_blocksize(sb, sdp->sd_sb.sb_bsize);
/* Get the root inode */
no_addr = sdp->sd_sb.sb_root_dir.no_addr;
ret = gfs2_lookup_root(sb, &sdp->sd_root_dir, no_addr, "root");
if (ret)
goto out;
/* Get the master inode */
no_addr = sdp->sd_sb.sb_master_dir.no_addr;
ret = gfs2_lookup_root(sb, &sdp->sd_master_dir, no_addr, "master");
if (ret) {
dput(sdp->sd_root_dir);
goto out;
}
sb->s_root = dget(sdp->sd_args.ar_meta ? sdp->sd_master_dir : sdp->sd_root_dir);
out:
gfs2_glock_dq_uninit(&sb_gh);
return ret;
}
static void gfs2_others_may_mount(struct gfs2_sbd *sdp)
{
char *message = "FIRSTMOUNT=Done";
char *envp[] = { message, NULL };
fs_info(sdp, "first mount done, others may mount\n");
if (sdp->sd_lockstruct.ls_ops->lm_first_done)
sdp->sd_lockstruct.ls_ops->lm_first_done(sdp);
kobject_uevent_env(&sdp->sd_kobj, KOBJ_CHANGE, envp);
}
/**
* gfs2_jindex_hold - Grab a lock on the jindex
* @sdp: The GFS2 superblock
* @ji_gh: the holder for the jindex glock
*
* Returns: errno
*/
static int gfs2_jindex_hold(struct gfs2_sbd *sdp, struct gfs2_holder *ji_gh)
{
struct gfs2_inode *dip = GFS2_I(sdp->sd_jindex);
struct qstr name;
char buf[20];
struct gfs2_jdesc *jd;
int error;
name.name = buf;
mutex_lock(&sdp->sd_jindex_mutex);
for (;;) {
struct gfs2_inode *jip;
error = gfs2_glock_nq_init(dip->i_gl, LM_ST_SHARED, 0, ji_gh);
if (error)
break;
name.len = sprintf(buf, "journal%u", sdp->sd_journals);
name.hash = gfs2_disk_hash(name.name, name.len);
error = gfs2_dir_check(sdp->sd_jindex, &name, NULL);
if (error == -ENOENT) {
error = 0;
break;
}
gfs2_glock_dq_uninit(ji_gh);
if (error)
break;
error = -ENOMEM;
jd = kzalloc(sizeof(struct gfs2_jdesc), GFP_KERNEL);
if (!jd)
break;
INIT_LIST_HEAD(&jd->extent_list);
INIT_LIST_HEAD(&jd->jd_revoke_list);
INIT_WORK(&jd->jd_work, gfs2_recover_func);
jd->jd_inode = gfs2_lookupi(sdp->sd_jindex, &name, 1);
if (IS_ERR_OR_NULL(jd->jd_inode)) {
if (!jd->jd_inode)
error = -ENOENT;
else
error = PTR_ERR(jd->jd_inode);
kfree(jd);
break;
}
d_mark_dontcache(jd->jd_inode);
spin_lock(&sdp->sd_jindex_spin);
jd->jd_jid = sdp->sd_journals++;
jip = GFS2_I(jd->jd_inode);
jd->jd_no_addr = jip->i_no_addr;
list_add_tail(&jd->jd_list, &sdp->sd_jindex_list);
spin_unlock(&sdp->sd_jindex_spin);
}
mutex_unlock(&sdp->sd_jindex_mutex);
return error;
}
/**
* init_statfs - look up and initialize master and local (per node) statfs inodes
* @sdp: The GFS2 superblock
*
* This should be called after the jindex is initialized in init_journal() and
* before gfs2_journal_recovery() is called because we need to be able to write
* to these inodes during recovery.
*
* Returns: errno
*/
static int init_statfs(struct gfs2_sbd *sdp)
{
int error = 0;
struct inode *master = d_inode(sdp->sd_master_dir);
struct inode *pn = NULL;
char buf[30];
struct gfs2_jdesc *jd;
struct gfs2_inode *ip;
sdp->sd_statfs_inode = gfs2_lookup_meta(master, "statfs");
if (IS_ERR(sdp->sd_statfs_inode)) {
error = PTR_ERR(sdp->sd_statfs_inode);
fs_err(sdp, "can't read in statfs inode: %d\n", error);
goto out;
}
if (sdp->sd_args.ar_spectator)
goto out;
pn = gfs2_lookup_meta(master, "per_node");
if (IS_ERR(pn)) {
error = PTR_ERR(pn);
fs_err(sdp, "can't find per_node directory: %d\n", error);
goto put_statfs;
}
/* For each jid, lookup the corresponding local statfs inode in the
* per_node metafs directory and save it in the sdp->sd_sc_inodes_list. */
list_for_each_entry(jd, &sdp->sd_jindex_list, jd_list) {
struct local_statfs_inode *lsi =
kmalloc(sizeof(struct local_statfs_inode), GFP_NOFS);
if (!lsi) {
error = -ENOMEM;
goto free_local;
}
sprintf(buf, "statfs_change%u", jd->jd_jid);
lsi->si_sc_inode = gfs2_lookup_meta(pn, buf);
if (IS_ERR(lsi->si_sc_inode)) {
error = PTR_ERR(lsi->si_sc_inode);
fs_err(sdp, "can't find local \"sc\" file#%u: %d\n",
jd->jd_jid, error);
kfree(lsi);
goto free_local;
}
lsi->si_jid = jd->jd_jid;
if (jd->jd_jid == sdp->sd_jdesc->jd_jid)
sdp->sd_sc_inode = lsi->si_sc_inode;
list_add_tail(&lsi->si_list, &sdp->sd_sc_inodes_list);
}
iput(pn);
pn = NULL;
ip = GFS2_I(sdp->sd_sc_inode);
error = gfs2_glock_nq_init(ip->i_gl, LM_ST_EXCLUSIVE, GL_NOPID,
&sdp->sd_sc_gh);
if (error) {
fs_err(sdp, "can't lock local \"sc\" file: %d\n", error);
goto free_local;
}
/* read in the local statfs buffer - other nodes don't change it. */
error = gfs2_meta_inode_buffer(ip, &sdp->sd_sc_bh);
if (error) {
fs_err(sdp, "Cannot read in local statfs: %d\n", error);
goto unlock_sd_gh;
}
return 0;
unlock_sd_gh:
gfs2_glock_dq_uninit(&sdp->sd_sc_gh);
free_local:
free_local_statfs_inodes(sdp);
iput(pn);
put_statfs:
iput(sdp->sd_statfs_inode);
out:
return error;
}
/* Uninitialize and free up memory used by the list of statfs inodes */
static void uninit_statfs(struct gfs2_sbd *sdp)
{
if (!sdp->sd_args.ar_spectator) {
brelse(sdp->sd_sc_bh);
gfs2_glock_dq_uninit(&sdp->sd_sc_gh);
free_local_statfs_inodes(sdp);
}
iput(sdp->sd_statfs_inode);
}
static int init_journal(struct gfs2_sbd *sdp, int undo)
{
struct inode *master = d_inode(sdp->sd_master_dir);
struct gfs2_holder ji_gh;
struct gfs2_inode *ip;
int error = 0;
gfs2_holder_mark_uninitialized(&ji_gh);
if (undo)
goto fail_statfs;
sdp->sd_jindex = gfs2_lookup_meta(master, "jindex");
if (IS_ERR(sdp->sd_jindex)) {
fs_err(sdp, "can't lookup journal index: %d\n", error);
return PTR_ERR(sdp->sd_jindex);
}
/* Load in the journal index special file */
error = gfs2_jindex_hold(sdp, &ji_gh);
if (error) {
fs_err(sdp, "can't read journal index: %d\n", error);
goto fail;
}
error = -EUSERS;
if (!gfs2_jindex_size(sdp)) {
fs_err(sdp, "no journals!\n");
goto fail_jindex;
}
atomic_set(&sdp->sd_log_blks_needed, 0);
if (sdp->sd_args.ar_spectator) {
sdp->sd_jdesc = gfs2_jdesc_find(sdp, 0);
atomic_set(&sdp->sd_log_blks_free, sdp->sd_jdesc->jd_blocks);
atomic_set(&sdp->sd_log_thresh1, 2*sdp->sd_jdesc->jd_blocks/5);
atomic_set(&sdp->sd_log_thresh2, 4*sdp->sd_jdesc->jd_blocks/5);
} else {
if (sdp->sd_lockstruct.ls_jid >= gfs2_jindex_size(sdp)) {
fs_err(sdp, "can't mount journal #%u\n",
sdp->sd_lockstruct.ls_jid);
fs_err(sdp, "there are only %u journals (0 - %u)\n",
gfs2_jindex_size(sdp),
gfs2_jindex_size(sdp) - 1);
goto fail_jindex;
}
sdp->sd_jdesc = gfs2_jdesc_find(sdp, sdp->sd_lockstruct.ls_jid);
error = gfs2_glock_nq_num(sdp, sdp->sd_lockstruct.ls_jid,
&gfs2_journal_glops,
gfs2: Force withdraw to replay journals and wait for it to finish When a node withdraws from a file system, it often leaves its journal in an incomplete state. This is especially true when the withdraw is caused by io errors writing to the journal. Before this patch, a withdraw would try to write a "shutdown" record to the journal, tell dlm it's done with the file system, and none of the other nodes know about the problem. Later, when the problem is fixed and the withdrawn node is rebooted, it would then discover that its own journal was incomplete, and replay it. However, replaying it at this point is almost guaranteed to introduce corruption because the other nodes are likely to have used affected resource groups that appeared in the journal since the time of the withdraw. Replaying the journal later will overwrite any changes made, and not through any fault of dlm, which was instructed during the withdraw to release those resources. This patch makes file system withdraws seen by the entire cluster. Withdrawing nodes dequeue their journal glock to allow recovery. The remaining nodes check all the journals to see if they are clean or in need of replay. They try to replay dirty journals, but only the journals of withdrawn nodes will be "not busy" and therefore available for replay. Until the journal replay is complete, no i/o related glocks may be given out, to ensure that the replay does not cause the aforementioned corruption: We cannot allow any journal replay to overwrite blocks associated with a glock once it is held. The "live" glock which is now used to signal when a withdraw occurs. When a withdraw occurs, the node signals its withdraw by dequeueing the "live" glock and trying to enqueue it in EX mode, thus forcing the other nodes to all see a demote request, by way of a "1CB" (one callback) try lock. The "live" glock is not granted in EX; the callback is only just used to indicate a withdraw has occurred. Note that all nodes in the cluster must wait for the recovering node to finish replaying the withdrawing node's journal before continuing. To this end, it checks that the journals are clean multiple times in a retry loop. Also note that the withdraw function may be called from a wide variety of situations, and therefore, we need to take extra precautions to make sure pointers are valid before using them in many circumstances. We also need to take care when glocks decide to withdraw, since the withdraw code now uses glocks. Also, before this patch, if a process encountered an error and decided to withdraw, if another process was already withdrawing, the second withdraw would be silently ignored, which set it free to unlock its glocks. That's correct behavior if the original withdrawer encounters further errors down the road. But if secondary waiters don't wait for the journal replay, unlocking glocks will allow other nodes to use them, despite the fact that the journal containing those blocks is being replayed. The replay needs to finish before our glocks are released to other nodes. IOW, secondary withdraws need to wait for the first withdraw to finish. For example, if an rgrp glock is unlocked by a process that didn't wait for the first withdraw, a journal replay could introduce file system corruption by replaying a rgrp block that has already been granted to a different cluster node. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-28 19:23:45 +00:00
LM_ST_EXCLUSIVE,
LM_FLAG_NOEXP | GL_NOCACHE | GL_NOPID,
&sdp->sd_journal_gh);
if (error) {
fs_err(sdp, "can't acquire journal glock: %d\n", error);
goto fail_jindex;
}
ip = GFS2_I(sdp->sd_jdesc->jd_inode);
gfs2: Force withdraw to replay journals and wait for it to finish When a node withdraws from a file system, it often leaves its journal in an incomplete state. This is especially true when the withdraw is caused by io errors writing to the journal. Before this patch, a withdraw would try to write a "shutdown" record to the journal, tell dlm it's done with the file system, and none of the other nodes know about the problem. Later, when the problem is fixed and the withdrawn node is rebooted, it would then discover that its own journal was incomplete, and replay it. However, replaying it at this point is almost guaranteed to introduce corruption because the other nodes are likely to have used affected resource groups that appeared in the journal since the time of the withdraw. Replaying the journal later will overwrite any changes made, and not through any fault of dlm, which was instructed during the withdraw to release those resources. This patch makes file system withdraws seen by the entire cluster. Withdrawing nodes dequeue their journal glock to allow recovery. The remaining nodes check all the journals to see if they are clean or in need of replay. They try to replay dirty journals, but only the journals of withdrawn nodes will be "not busy" and therefore available for replay. Until the journal replay is complete, no i/o related glocks may be given out, to ensure that the replay does not cause the aforementioned corruption: We cannot allow any journal replay to overwrite blocks associated with a glock once it is held. The "live" glock which is now used to signal when a withdraw occurs. When a withdraw occurs, the node signals its withdraw by dequeueing the "live" glock and trying to enqueue it in EX mode, thus forcing the other nodes to all see a demote request, by way of a "1CB" (one callback) try lock. The "live" glock is not granted in EX; the callback is only just used to indicate a withdraw has occurred. Note that all nodes in the cluster must wait for the recovering node to finish replaying the withdrawing node's journal before continuing. To this end, it checks that the journals are clean multiple times in a retry loop. Also note that the withdraw function may be called from a wide variety of situations, and therefore, we need to take extra precautions to make sure pointers are valid before using them in many circumstances. We also need to take care when glocks decide to withdraw, since the withdraw code now uses glocks. Also, before this patch, if a process encountered an error and decided to withdraw, if another process was already withdrawing, the second withdraw would be silently ignored, which set it free to unlock its glocks. That's correct behavior if the original withdrawer encounters further errors down the road. But if secondary waiters don't wait for the journal replay, unlocking glocks will allow other nodes to use them, despite the fact that the journal containing those blocks is being replayed. The replay needs to finish before our glocks are released to other nodes. IOW, secondary withdraws need to wait for the first withdraw to finish. For example, if an rgrp glock is unlocked by a process that didn't wait for the first withdraw, a journal replay could introduce file system corruption by replaying a rgrp block that has already been granted to a different cluster node. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-28 19:23:45 +00:00
sdp->sd_jinode_gl = ip->i_gl;
error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED,
LM_FLAG_NOEXP | GL_EXACT |
GL_NOCACHE | GL_NOPID,
&sdp->sd_jinode_gh);
if (error) {
fs_err(sdp, "can't acquire journal inode glock: %d\n",
error);
goto fail_journal_gh;
}
error = gfs2_jdesc_check(sdp->sd_jdesc);
if (error) {
fs_err(sdp, "my journal (%u) is bad: %d\n",
sdp->sd_jdesc->jd_jid, error);
goto fail_jinode_gh;
}
atomic_set(&sdp->sd_log_blks_free, sdp->sd_jdesc->jd_blocks);
atomic_set(&sdp->sd_log_thresh1, 2*sdp->sd_jdesc->jd_blocks/5);
atomic_set(&sdp->sd_log_thresh2, 4*sdp->sd_jdesc->jd_blocks/5);
/* Map the extents for this journal's blocks */
gfs2_map_journal_extents(sdp, sdp->sd_jdesc);
}
trace_gfs2_log_blocks(sdp, atomic_read(&sdp->sd_log_blks_free));
/* Lookup statfs inodes here so journal recovery can use them. */
error = init_statfs(sdp);
if (error)
goto fail_jinode_gh;
if (sdp->sd_lockstruct.ls_first) {
unsigned int x;
for (x = 0; x < sdp->sd_journals; x++) {
struct gfs2_jdesc *jd = gfs2_jdesc_find(sdp, x);
if (sdp->sd_args.ar_spectator) {
error = check_journal_clean(sdp, jd, true);
if (error)
goto fail_statfs;
continue;
}
error = gfs2_recover_journal(jd, true);
if (error) {
fs_err(sdp, "error recovering journal %u: %d\n",
x, error);
goto fail_statfs;
}
}
gfs2_others_may_mount(sdp);
} else if (!sdp->sd_args.ar_spectator) {
error = gfs2_recover_journal(sdp->sd_jdesc, true);
if (error) {
fs_err(sdp, "error recovering my journal: %d\n", error);
goto fail_statfs;
}
}
sdp->sd_log_idle = 1;
set_bit(SDF_JOURNAL_CHECKED, &sdp->sd_flags);
gfs2_glock_dq_uninit(&ji_gh);
INIT_WORK(&sdp->sd_freeze_work, gfs2_freeze_func);
return 0;
fail_statfs:
uninit_statfs(sdp);
fail_jinode_gh:
gfs2: Force withdraw to replay journals and wait for it to finish When a node withdraws from a file system, it often leaves its journal in an incomplete state. This is especially true when the withdraw is caused by io errors writing to the journal. Before this patch, a withdraw would try to write a "shutdown" record to the journal, tell dlm it's done with the file system, and none of the other nodes know about the problem. Later, when the problem is fixed and the withdrawn node is rebooted, it would then discover that its own journal was incomplete, and replay it. However, replaying it at this point is almost guaranteed to introduce corruption because the other nodes are likely to have used affected resource groups that appeared in the journal since the time of the withdraw. Replaying the journal later will overwrite any changes made, and not through any fault of dlm, which was instructed during the withdraw to release those resources. This patch makes file system withdraws seen by the entire cluster. Withdrawing nodes dequeue their journal glock to allow recovery. The remaining nodes check all the journals to see if they are clean or in need of replay. They try to replay dirty journals, but only the journals of withdrawn nodes will be "not busy" and therefore available for replay. Until the journal replay is complete, no i/o related glocks may be given out, to ensure that the replay does not cause the aforementioned corruption: We cannot allow any journal replay to overwrite blocks associated with a glock once it is held. The "live" glock which is now used to signal when a withdraw occurs. When a withdraw occurs, the node signals its withdraw by dequeueing the "live" glock and trying to enqueue it in EX mode, thus forcing the other nodes to all see a demote request, by way of a "1CB" (one callback) try lock. The "live" glock is not granted in EX; the callback is only just used to indicate a withdraw has occurred. Note that all nodes in the cluster must wait for the recovering node to finish replaying the withdrawing node's journal before continuing. To this end, it checks that the journals are clean multiple times in a retry loop. Also note that the withdraw function may be called from a wide variety of situations, and therefore, we need to take extra precautions to make sure pointers are valid before using them in many circumstances. We also need to take care when glocks decide to withdraw, since the withdraw code now uses glocks. Also, before this patch, if a process encountered an error and decided to withdraw, if another process was already withdrawing, the second withdraw would be silently ignored, which set it free to unlock its glocks. That's correct behavior if the original withdrawer encounters further errors down the road. But if secondary waiters don't wait for the journal replay, unlocking glocks will allow other nodes to use them, despite the fact that the journal containing those blocks is being replayed. The replay needs to finish before our glocks are released to other nodes. IOW, secondary withdraws need to wait for the first withdraw to finish. For example, if an rgrp glock is unlocked by a process that didn't wait for the first withdraw, a journal replay could introduce file system corruption by replaying a rgrp block that has already been granted to a different cluster node. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-28 19:23:45 +00:00
/* A withdraw may have done dq/uninit so now we need to check it */
if (!sdp->sd_args.ar_spectator &&
gfs2_holder_initialized(&sdp->sd_jinode_gh))
gfs2_glock_dq_uninit(&sdp->sd_jinode_gh);
fail_journal_gh:
gfs2: Force withdraw to replay journals and wait for it to finish When a node withdraws from a file system, it often leaves its journal in an incomplete state. This is especially true when the withdraw is caused by io errors writing to the journal. Before this patch, a withdraw would try to write a "shutdown" record to the journal, tell dlm it's done with the file system, and none of the other nodes know about the problem. Later, when the problem is fixed and the withdrawn node is rebooted, it would then discover that its own journal was incomplete, and replay it. However, replaying it at this point is almost guaranteed to introduce corruption because the other nodes are likely to have used affected resource groups that appeared in the journal since the time of the withdraw. Replaying the journal later will overwrite any changes made, and not through any fault of dlm, which was instructed during the withdraw to release those resources. This patch makes file system withdraws seen by the entire cluster. Withdrawing nodes dequeue their journal glock to allow recovery. The remaining nodes check all the journals to see if they are clean or in need of replay. They try to replay dirty journals, but only the journals of withdrawn nodes will be "not busy" and therefore available for replay. Until the journal replay is complete, no i/o related glocks may be given out, to ensure that the replay does not cause the aforementioned corruption: We cannot allow any journal replay to overwrite blocks associated with a glock once it is held. The "live" glock which is now used to signal when a withdraw occurs. When a withdraw occurs, the node signals its withdraw by dequeueing the "live" glock and trying to enqueue it in EX mode, thus forcing the other nodes to all see a demote request, by way of a "1CB" (one callback) try lock. The "live" glock is not granted in EX; the callback is only just used to indicate a withdraw has occurred. Note that all nodes in the cluster must wait for the recovering node to finish replaying the withdrawing node's journal before continuing. To this end, it checks that the journals are clean multiple times in a retry loop. Also note that the withdraw function may be called from a wide variety of situations, and therefore, we need to take extra precautions to make sure pointers are valid before using them in many circumstances. We also need to take care when glocks decide to withdraw, since the withdraw code now uses glocks. Also, before this patch, if a process encountered an error and decided to withdraw, if another process was already withdrawing, the second withdraw would be silently ignored, which set it free to unlock its glocks. That's correct behavior if the original withdrawer encounters further errors down the road. But if secondary waiters don't wait for the journal replay, unlocking glocks will allow other nodes to use them, despite the fact that the journal containing those blocks is being replayed. The replay needs to finish before our glocks are released to other nodes. IOW, secondary withdraws need to wait for the first withdraw to finish. For example, if an rgrp glock is unlocked by a process that didn't wait for the first withdraw, a journal replay could introduce file system corruption by replaying a rgrp block that has already been granted to a different cluster node. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-28 19:23:45 +00:00
if (!sdp->sd_args.ar_spectator &&
gfs2_holder_initialized(&sdp->sd_journal_gh))
gfs2_glock_dq_uninit(&sdp->sd_journal_gh);
fail_jindex:
gfs2_jindex_free(sdp);
if (gfs2_holder_initialized(&ji_gh))
gfs2_glock_dq_uninit(&ji_gh);
fail:
iput(sdp->sd_jindex);
return error;
}
static struct lock_class_key gfs2_quota_imutex_key;
static int init_inodes(struct gfs2_sbd *sdp, int undo)
{
int error = 0;
struct inode *master = d_inode(sdp->sd_master_dir);
if (undo)
goto fail_qinode;
error = init_journal(sdp, undo);
complete_all(&sdp->sd_journal_ready);
if (error)
goto fail;
/* Read in the resource index inode */
sdp->sd_rindex = gfs2_lookup_meta(master, "rindex");
if (IS_ERR(sdp->sd_rindex)) {
error = PTR_ERR(sdp->sd_rindex);
fs_err(sdp, "can't get resource index inode: %d\n", error);
goto fail_journal;
}
sdp->sd_rindex_uptodate = 0;
/* Read in the quota inode */
sdp->sd_quota_inode = gfs2_lookup_meta(master, "quota");
if (IS_ERR(sdp->sd_quota_inode)) {
error = PTR_ERR(sdp->sd_quota_inode);
fs_err(sdp, "can't get quota file inode: %d\n", error);
goto fail_rindex;
}
/*
* i_rwsem on quota files is special. Since this inode is hidden system
* file, we are safe to define locking ourselves.
*/
lockdep_set_class(&sdp->sd_quota_inode->i_rwsem,
&gfs2_quota_imutex_key);
error = gfs2_rindex_update(sdp);
if (error)
goto fail_qinode;
return 0;
fail_qinode:
iput(sdp->sd_quota_inode);
fail_rindex:
gfs2_clear_rgrpd(sdp);
iput(sdp->sd_rindex);
fail_journal:
init_journal(sdp, UNDO);
fail:
return error;
}
static int init_per_node(struct gfs2_sbd *sdp, int undo)
{
struct inode *pn = NULL;
char buf[30];
int error = 0;
struct gfs2_inode *ip;
struct inode *master = d_inode(sdp->sd_master_dir);
if (sdp->sd_args.ar_spectator)
return 0;
if (undo)
goto fail_qc_gh;
pn = gfs2_lookup_meta(master, "per_node");
if (IS_ERR(pn)) {
error = PTR_ERR(pn);
fs_err(sdp, "can't find per_node directory: %d\n", error);
return error;
}
sprintf(buf, "quota_change%u", sdp->sd_jdesc->jd_jid);
sdp->sd_qc_inode = gfs2_lookup_meta(pn, buf);
if (IS_ERR(sdp->sd_qc_inode)) {
error = PTR_ERR(sdp->sd_qc_inode);
fs_err(sdp, "can't find local \"qc\" file: %d\n", error);
goto fail_ut_i;
}
iput(pn);
pn = NULL;
ip = GFS2_I(sdp->sd_qc_inode);
error = gfs2_glock_nq_init(ip->i_gl, LM_ST_EXCLUSIVE, GL_NOPID,
&sdp->sd_qc_gh);
if (error) {
fs_err(sdp, "can't lock local \"qc\" file: %d\n", error);
goto fail_qc_i;
}
return 0;
fail_qc_gh:
gfs2_glock_dq_uninit(&sdp->sd_qc_gh);
fail_qc_i:
iput(sdp->sd_qc_inode);
fail_ut_i:
iput(pn);
return error;
}
static const match_table_t nolock_tokens = {
{ Opt_jid, "jid=%d", },
{ Opt_err, NULL },
};
static const struct lm_lockops nolock_ops = {
.lm_proto_name = "lock_nolock",
.lm_put_lock = gfs2_glock_free,
.lm_tokens = &nolock_tokens,
};
/**
* gfs2_lm_mount - mount a locking protocol
* @sdp: the filesystem
* @silent: if 1, don't complain if the FS isn't a GFS2 fs
*
* Returns: errno
*/
static int gfs2_lm_mount(struct gfs2_sbd *sdp, int silent)
{
const struct lm_lockops *lm;
struct lm_lockstruct *ls = &sdp->sd_lockstruct;
struct gfs2_args *args = &sdp->sd_args;
const char *proto = sdp->sd_proto_name;
const char *table = sdp->sd_table_name;
char *o, *options;
int ret;
if (!strcmp("lock_nolock", proto)) {
lm = &nolock_ops;
sdp->sd_args.ar_localflocks = 1;
#ifdef CONFIG_GFS2_FS_LOCKING_DLM
} else if (!strcmp("lock_dlm", proto)) {
lm = &gfs2_dlm_ops;
#endif
} else {
pr_info("can't find protocol %s\n", proto);
return -ENOENT;
}
fs_info(sdp, "Trying to join cluster \"%s\", \"%s\"\n", proto, table);
ls->ls_ops = lm;
ls->ls_first = 1;
for (options = args->ar_hostdata; (o = strsep(&options, ":")); ) {
substring_t tmp[MAX_OPT_ARGS];
int token, option;
if (!o || !*o)
continue;
token = match_token(o, *lm->lm_tokens, tmp);
switch (token) {
case Opt_jid:
ret = match_int(&tmp[0], &option);
if (ret || option < 0)
goto hostdata_error;
if (test_and_clear_bit(SDF_NOJOURNALID, &sdp->sd_flags))
ls->ls_jid = option;
break;
case Opt_id:
dlm: fixes for nodir mode The "nodir" mode (statically assign master nodes instead of using the resource directory) has always been highly experimental, and never seriously used. This commit fixes a number of problems, making nodir much more usable. - Major change to recovery: recover all locks and restart all in-progress operations after recovery. In some cases it's not possible to know which in-progess locks to recover, so recover all. (Most require recovery in nodir mode anyway since rehashing changes most master nodes.) - Change the way nodir mode is enabled, from a command line mount arg passed through gfs2, into a sysfs file managed by dlm_controld, consistent with the other config settings. - Allow recovering MSTCPY locks on an rsb that has not yet been turned into a master copy. - Ignore RCOM_LOCK and RCOM_LOCK_REPLY recovery messages from a previous, aborted recovery cycle. Base this on the local recovery status not being in the state where any nodes should be sending LOCK messages for the current recovery cycle. - Hold rsb lock around dlm_purge_mstcpy_locks() because it may run concurrently with dlm_recover_master_copy(). - Maintain highbast on process-copy lkb's (in addition to the master as is usual), because the lkb can switch back and forth between being a master and being a process copy as the master node changes in recovery. - When recovering MSTCPY locks, flag rsb's that have non-empty convert or waiting queues for granting at the end of recovery. (Rename flag from LOCKS_PURGED to RECOVER_GRANT and similar for the recovery function, because it's not only resources with purged locks that need grant a grant attempt.) - Replace a couple of unnecessary assertion panics with error messages. Signed-off-by: David Teigland <teigland@redhat.com>
2012-04-26 20:54:29 +00:00
case Opt_nodir:
/* Obsolete, but left for backward compat purposes */
break;
case Opt_first:
ret = match_int(&tmp[0], &option);
if (ret || (option != 0 && option != 1))
goto hostdata_error;
ls->ls_first = option;
break;
case Opt_err:
default:
hostdata_error:
fs_info(sdp, "unknown hostdata (%s)\n", o);
return -EINVAL;
}
}
if (lm->lm_mount == NULL) {
fs_info(sdp, "Now mounting FS (format %u)...\n", sdp->sd_sb.sb_fs_format);
complete_all(&sdp->sd_locking_init);
return 0;
}
ret = lm->lm_mount(sdp, table);
if (ret == 0)
fs_info(sdp, "Joined cluster. Now mounting FS (format %u)...\n",
sdp->sd_sb.sb_fs_format);
complete_all(&sdp->sd_locking_init);
return ret;
}
void gfs2_lm_unmount(struct gfs2_sbd *sdp)
{
const struct lm_lockops *lm = sdp->sd_lockstruct.ls_ops;
if (!gfs2_withdrawing_or_withdrawn(sdp) && lm->lm_unmount)
lm->lm_unmount(sdp);
}
static int wait_on_journal(struct gfs2_sbd *sdp)
{
if (sdp->sd_lockstruct.ls_ops->lm_mount == NULL)
return 0;
sched: Remove proliferation of wait_on_bit() action functions The current "wait_on_bit" interface requires an 'action' function to be provided which does the actual waiting. There are over 20 such functions, many of them identical. Most cases can be satisfied by one of just two functions, one which uses io_schedule() and one which just uses schedule(). So: Rename wait_on_bit and wait_on_bit_lock to wait_on_bit_action and wait_on_bit_lock_action to make it explicit that they need an action function. Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io which are *not* given an action function but implicitly use a standard one. The decision to error-out if a signal is pending is now made based on the 'mode' argument rather than being encoded in the action function. All instances of the old wait_on_bit and wait_on_bit_lock which can use the new version have been changed accordingly and their action functions have been discarded. wait_on_bit{_lock} does not return any specific error code in the event of a signal so the caller must check for non-zero and interpolate their own error code as appropriate. The wait_on_bit() call in __fscache_wait_on_invalidate() was ambiguous as it specified TASK_UNINTERRUPTIBLE but used fscache_wait_bit_interruptible as an action function. David Howells confirms this should be uniformly "uninterruptible" The main remaining user of wait_on_bit{,_lock}_action is NFS which needs to use a freezer-aware schedule() call. A comment in fs/gfs2/glock.c notes that having multiple 'action' functions is useful as they display differently in the 'wchan' field of 'ps'. (and /proc/$PID/wchan). As the new bit_wait{,_io} functions are tagged "__sched", they will not show up at all, but something higher in the stack. So the distinction will still be visible, only with different function names (gds2_glock_wait versus gfs2_glock_dq_wait in the gfs2/glock.c case). Since first version of this patch (against 3.15) two new action functions appeared, on in NFS and one in CIFS. CIFS also now uses an action function that makes the same freezer aware schedule call as NFS. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: David Howells <dhowells@redhat.com> (fscache, keys) Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2) Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steve French <sfrench@samba.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-07 05:16:04 +00:00
return wait_on_bit(&sdp->sd_flags, SDF_NOJOURNALID, TASK_INTERRUPTIBLE)
? -EINTR : 0;
}
void gfs2_online_uevent(struct gfs2_sbd *sdp)
{
struct super_block *sb = sdp->sd_vfs;
char ro[20];
char spectator[20];
char *envp[] = { ro, spectator, NULL };
sprintf(ro, "RDONLY=%d", sb_rdonly(sb));
sprintf(spectator, "SPECTATOR=%d", sdp->sd_args.ar_spectator ? 1 : 0);
kobject_uevent_env(&sdp->sd_kobj, KOBJ_ONLINE, envp);
}
static int init_threads(struct gfs2_sbd *sdp)
{
struct task_struct *p;
int error = 0;
p = kthread_create(gfs2_logd, sdp, "gfs2_logd/%s", sdp->sd_fsname);
if (IS_ERR(p)) {
error = PTR_ERR(p);
fs_err(sdp, "can't create logd thread: %d\n", error);
return error;
}
get_task_struct(p);
sdp->sd_logd_process = p;
p = kthread_create(gfs2_quotad, sdp, "gfs2_quotad/%s", sdp->sd_fsname);
if (IS_ERR(p)) {
error = PTR_ERR(p);
fs_err(sdp, "can't create quotad thread: %d\n", error);
goto fail;
}
get_task_struct(p);
sdp->sd_quotad_process = p;
wake_up_process(sdp->sd_logd_process);
wake_up_process(sdp->sd_quotad_process);
return 0;
fail:
kthread_stop_put(sdp->sd_logd_process);
sdp->sd_logd_process = NULL;
return error;
}
void gfs2_destroy_threads(struct gfs2_sbd *sdp)
{
if (sdp->sd_logd_process) {
kthread_stop_put(sdp->sd_logd_process);
sdp->sd_logd_process = NULL;
}
if (sdp->sd_quotad_process) {
kthread_stop_put(sdp->sd_quotad_process);
sdp->sd_quotad_process = NULL;
}
}
/**
* gfs2_fill_super - Read in superblock
* @sb: The VFS superblock
* @fc: Mount options and flags
*
* Returns: -errno
*/
static int gfs2_fill_super(struct super_block *sb, struct fs_context *fc)
{
struct gfs2_args *args = fc->fs_private;
int silent = fc->sb_flags & SB_SILENT;
struct gfs2_sbd *sdp;
struct gfs2_holder mount_gh;
int error;
sdp = init_sbd(sb);
if (!sdp) {
pr_warn("can't alloc struct gfs2_sbd\n");
return -ENOMEM;
}
sdp->sd_args = *args;
if (sdp->sd_args.ar_spectator) {
Rename superblock flags (MS_xyz -> SB_xyz) This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-27 21:05:09 +00:00
sb->s_flags |= SB_RDONLY;
set_bit(SDF_RORECOVERY, &sdp->sd_flags);
}
if (sdp->sd_args.ar_posix_acl)
Rename superblock flags (MS_xyz -> SB_xyz) This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-27 21:05:09 +00:00
sb->s_flags |= SB_POSIXACL;
if (sdp->sd_args.ar_nobarrier)
set_bit(SDF_NOBARRIERS, &sdp->sd_flags);
Rename superblock flags (MS_xyz -> SB_xyz) This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-27 21:05:09 +00:00
sb->s_flags |= SB_NOSEC;
sb->s_magic = GFS2_MAGIC;
sb->s_op = &gfs2_super_ops;
sb->s_d_op = &gfs2_dops;
sb->s_export_op = &gfs2_export_ops;
sb->s_qcop = &gfs2_quotactl_ops;
sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP;
sb_dqopt(sb)->flags |= DQUOT_QUOTA_SYS_FILE;
sb->s_time_gran = 1;
sb->s_maxbytes = MAX_LFS_FILESIZE;
/* Set up the buffer cache and fill in some fake block size values
to allow us to read-in the on-disk superblock. */
sdp->sd_sb.sb_bsize = sb_min_blocksize(sb, 512);
sdp->sd_sb.sb_bsize_shift = sb->s_blocksize_bits;
sdp->sd_fsb2bb_shift = sdp->sd_sb.sb_bsize_shift - 9;
sdp->sd_fsb2bb = BIT(sdp->sd_fsb2bb_shift);
sdp->sd_tune.gt_logd_secs = sdp->sd_args.ar_commit;
sdp->sd_tune.gt_quota_quantum = sdp->sd_args.ar_quota_quantum;
if (sdp->sd_args.ar_statfs_quantum) {
sdp->sd_tune.gt_statfs_slow = 0;
sdp->sd_tune.gt_statfs_quantum = sdp->sd_args.ar_statfs_quantum;
} else {
sdp->sd_tune.gt_statfs_slow = 1;
sdp->sd_tune.gt_statfs_quantum = 30;
}
error = init_names(sdp, silent);
gfs2: use-after-free in sysfs deregistration syzkaller found the following splat with CONFIG_DEBUG_KOBJECT_RELEASE=y: Read of size 1 at addr ffff000028e896b8 by task kworker/1:2/228 CPU: 1 PID: 228 Comm: kworker/1:2 Tainted: G S 5.9.0-rc8+ #101 Hardware name: linux,dummy-virt (DT) Workqueue: events kobject_delayed_cleanup Call trace: dump_backtrace+0x0/0x4d8 show_stack+0x34/0x48 dump_stack+0x174/0x1f8 print_address_description.constprop.0+0x5c/0x550 kasan_report+0x13c/0x1c0 __asan_report_load1_noabort+0x34/0x60 memcmp+0xd0/0xd8 gfs2_uevent+0xc4/0x188 kobject_uevent_env+0x54c/0x1240 kobject_uevent+0x2c/0x40 __kobject_del+0x190/0x1d8 kobject_delayed_cleanup+0x2bc/0x3b8 process_one_work+0x96c/0x18c0 worker_thread+0x3f0/0xc30 kthread+0x390/0x498 ret_from_fork+0x10/0x18 Allocated by task 1110: kasan_save_stack+0x28/0x58 __kasan_kmalloc.isra.0+0xc8/0xe8 kasan_kmalloc+0x10/0x20 kmem_cache_alloc_trace+0x1d8/0x2f0 alloc_super+0x64/0x8c0 sget_fc+0x110/0x620 get_tree_bdev+0x190/0x648 gfs2_get_tree+0x50/0x228 vfs_get_tree+0x84/0x2e8 path_mount+0x1134/0x1da8 do_mount+0x124/0x138 __arm64_sys_mount+0x164/0x238 el0_svc_common.constprop.0+0x15c/0x598 do_el0_svc+0x60/0x150 el0_svc+0x34/0xb0 el0_sync_handler+0xc8/0x5b4 el0_sync+0x15c/0x180 Freed by task 228: kasan_save_stack+0x28/0x58 kasan_set_track+0x28/0x40 kasan_set_free_info+0x24/0x48 __kasan_slab_free+0x118/0x190 kasan_slab_free+0x14/0x20 slab_free_freelist_hook+0x6c/0x210 kfree+0x13c/0x460 Use the same pattern as f2fs + ext4 where the kobject destruction must complete before allowing the FS itself to be freed. This means that we need an explicit free_sbd in the callers. Cc: Bob Peterson <rpeterso@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Jamie Iles <jamie@nuviainc.com> [Also go to fail_free when init_names fails.] Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-12 13:13:09 +00:00
if (error)
goto fail_free;
snprintf(sdp->sd_fsname, sizeof(sdp->sd_fsname), "%s", sdp->sd_table_name);
sdp->sd_delete_wq = alloc_workqueue("gfs2-delete/%s",
WQ_MEM_RECLAIM | WQ_FREEZABLE, 0, sdp->sd_fsname);
error = -ENOMEM;
if (!sdp->sd_delete_wq)
goto fail_free;
error = gfs2_sys_fs_add(sdp);
if (error)
goto fail_delete_wq;
gfs2_create_debugfs_file(sdp);
error = gfs2_lm_mount(sdp, silent);
if (error)
goto fail_debug;
error = init_locking(sdp, &mount_gh, DO);
if (error)
goto fail_lm;
error = init_sb(sdp, silent);
if (error)
goto fail_locking;
/* Turn rgrplvb on by default if fs format is recent enough */
if (!sdp->sd_args.ar_got_rgrplvb && sdp->sd_sb.sb_fs_format > 1801)
sdp->sd_args.ar_rgrplvb = 1;
error = wait_on_journal(sdp);
if (error)
goto fail_sb;
/*
* If user space has failed to join the cluster or some similar
* failure has occurred, then the journal id will contain a
* negative (error) number. This will then be returned to the
* caller (of the mount syscall). We do this even for spectator
* mounts (which just write a jid of 0 to indicate "ok" even though
* the jid is unused in the spectator case)
*/
if (sdp->sd_lockstruct.ls_jid < 0) {
error = sdp->sd_lockstruct.ls_jid;
sdp->sd_lockstruct.ls_jid = 0;
goto fail_sb;
}
if (sdp->sd_args.ar_spectator)
snprintf(sdp->sd_fsname, sizeof(sdp->sd_fsname), "%s.s",
sdp->sd_table_name);
else
snprintf(sdp->sd_fsname, sizeof(sdp->sd_fsname), "%s.%u",
sdp->sd_table_name, sdp->sd_lockstruct.ls_jid);
error = init_inodes(sdp, DO);
if (error)
goto fail_sb;
error = init_per_node(sdp, DO);
if (error)
goto fail_inodes;
error = gfs2_statfs_init(sdp);
if (error) {
fs_err(sdp, "can't initialize statfs subsystem: %d\n", error);
goto fail_per_node;
}
if (!sb_rdonly(sb)) {
error = init_threads(sdp);
if (error)
goto fail_per_node;
}
error = gfs2_freeze_lock_shared(sdp);
if (error)
goto fail_per_node;
if (!sb_rdonly(sb))
error = gfs2_make_fs_rw(sdp);
if (error) {
gfs2: Rework freeze / thaw logic So far, at mount time, gfs2 would take the freeze glock in shared mode and then immediately drop it again, turning it into a cached glock that can be reclaimed at any time. To freeze the filesystem cluster-wide, the node initiating the freeze would take the freeze glock in exclusive mode, which would cause the freeze glock's freeze_go_sync() callback to run on each node. There, gfs2 would freeze the filesystem and schedule gfs2_freeze_func() to run. gfs2_freeze_func() would re-acquire the freeze glock in shared mode, thaw the filesystem, and drop the freeze glock again. The initiating node would keep the freeze glock held in exclusive mode. To thaw the filesystem, the initiating node would drop the freeze glock again, which would allow gfs2_freeze_func() to resume on all nodes, leaving the filesystem in the thawed state. It turns out that in freeze_go_sync(), we cannot reliably and safely freeze the filesystem. This is primarily because the final unmount of a filesystem takes a write lock on the s_umount rw semaphore before calling into gfs2_put_super(), and freeze_go_sync() needs to call freeze_super() which also takes a write lock on the same semaphore, causing a deadlock. We could work around this by trying to take an active reference on the super block first, which would prevent unmount from running at the same time. But that can fail, and freeze_go_sync() isn't actually allowed to fail. To get around this, this patch changes the freeze glock locking scheme as follows: At mount time, each node takes the freeze glock in shared mode. To freeze a filesystem, the initiating node first freezes the filesystem locally and then drops and re-acquires the freeze glock in exclusive mode. All other nodes notice that there is contention on the freeze glock in their go_callback callbacks, and they schedule gfs2_freeze_func() to run. There, they freeze the filesystem locally and drop and re-acquire the freeze glock before re-thawing the filesystem. This is happening outside of the glock state engine, so there, we are allowed to fail. From a cluster point of view, taking and immediately dropping a glock is indistinguishable from taking the glock and only dropping it upon contention, so this new scheme is compatible with the old one. Thanks to Li Dong <lidong@vivo.com> for reporting a locking bug in gfs2_freeze_func() in a previous version of this commit. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-11-14 22:34:50 +00:00
gfs2_freeze_unlock(&sdp->sd_freeze_gh);
gfs2_destroy_threads(sdp);
fs_err(sdp, "can't make FS RW: %d\n", error);
goto fail_per_node;
}
gfs2_glock_dq_uninit(&mount_gh);
gfs2_online_uevent(sdp);
return 0;
fail_per_node:
init_per_node(sdp, UNDO);
fail_inodes:
init_inodes(sdp, UNDO);
fail_sb:
if (sdp->sd_root_dir)
dput(sdp->sd_root_dir);
if (sdp->sd_master_dir)
dput(sdp->sd_master_dir);
if (sb->s_root)
dput(sb->s_root);
sb->s_root = NULL;
fail_locking:
init_locking(sdp, &mount_gh, UNDO);
fail_lm:
complete_all(&sdp->sd_journal_ready);
gfs2_gl_hash_clear(sdp);
gfs2_lm_unmount(sdp);
fail_debug:
gfs2_delete_debugfs_file(sdp);
gfs2_sys_fs_del(sdp);
fail_delete_wq:
destroy_workqueue(sdp->sd_delete_wq);
gfs2: use-after-free in sysfs deregistration syzkaller found the following splat with CONFIG_DEBUG_KOBJECT_RELEASE=y: Read of size 1 at addr ffff000028e896b8 by task kworker/1:2/228 CPU: 1 PID: 228 Comm: kworker/1:2 Tainted: G S 5.9.0-rc8+ #101 Hardware name: linux,dummy-virt (DT) Workqueue: events kobject_delayed_cleanup Call trace: dump_backtrace+0x0/0x4d8 show_stack+0x34/0x48 dump_stack+0x174/0x1f8 print_address_description.constprop.0+0x5c/0x550 kasan_report+0x13c/0x1c0 __asan_report_load1_noabort+0x34/0x60 memcmp+0xd0/0xd8 gfs2_uevent+0xc4/0x188 kobject_uevent_env+0x54c/0x1240 kobject_uevent+0x2c/0x40 __kobject_del+0x190/0x1d8 kobject_delayed_cleanup+0x2bc/0x3b8 process_one_work+0x96c/0x18c0 worker_thread+0x3f0/0xc30 kthread+0x390/0x498 ret_from_fork+0x10/0x18 Allocated by task 1110: kasan_save_stack+0x28/0x58 __kasan_kmalloc.isra.0+0xc8/0xe8 kasan_kmalloc+0x10/0x20 kmem_cache_alloc_trace+0x1d8/0x2f0 alloc_super+0x64/0x8c0 sget_fc+0x110/0x620 get_tree_bdev+0x190/0x648 gfs2_get_tree+0x50/0x228 vfs_get_tree+0x84/0x2e8 path_mount+0x1134/0x1da8 do_mount+0x124/0x138 __arm64_sys_mount+0x164/0x238 el0_svc_common.constprop.0+0x15c/0x598 do_el0_svc+0x60/0x150 el0_svc+0x34/0xb0 el0_sync_handler+0xc8/0x5b4 el0_sync+0x15c/0x180 Freed by task 228: kasan_save_stack+0x28/0x58 kasan_set_track+0x28/0x40 kasan_set_free_info+0x24/0x48 __kasan_slab_free+0x118/0x190 kasan_slab_free+0x14/0x20 slab_free_freelist_hook+0x6c/0x210 kfree+0x13c/0x460 Use the same pattern as f2fs + ext4 where the kobject destruction must complete before allowing the FS itself to be freed. This means that we need an explicit free_sbd in the callers. Cc: Bob Peterson <rpeterso@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Jamie Iles <jamie@nuviainc.com> [Also go to fail_free when init_names fails.] Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-12 13:13:09 +00:00
fail_free:
free_sbd(sdp);
sb->s_fs_info = NULL;
return error;
}
/**
* gfs2_get_tree - Get the GFS2 superblock and root directory
* @fc: The filesystem context
*
* Returns: 0 or -errno on error
*/
static int gfs2_get_tree(struct fs_context *fc)
{
struct gfs2_args *args = fc->fs_private;
struct gfs2_sbd *sdp;
int error;
error = get_tree_bdev(fc, gfs2_fill_super);
if (error)
return error;
sdp = fc->root->d_sb->s_fs_info;
dput(fc->root);
if (args->ar_meta)
fc->root = dget(sdp->sd_master_dir);
else
fc->root = dget(sdp->sd_root_dir);
return 0;
}
static void gfs2_fc_free(struct fs_context *fc)
{
struct gfs2_args *args = fc->fs_private;
kfree(args);
}
enum gfs2_param {
Opt_lockproto,
Opt_locktable,
Opt_hostdata,
Opt_spectator,
Opt_ignore_local_fs,
Opt_localflocks,
Opt_localcaching,
Opt_debug,
Opt_upgrade,
Opt_acl,
Opt_quota,
Opt_quota_flag,
Opt_suiddir,
Opt_data,
Opt_meta,
Opt_discard,
Opt_commit,
Opt_errors,
Opt_statfs_quantum,
Opt_statfs_percent,
Opt_quota_quantum,
Opt_barrier,
Opt_rgrplvb,
Opt_loccookie,
};
static const struct constant_table gfs2_param_quota[] = {
{"off", GFS2_QUOTA_OFF},
{"account", GFS2_QUOTA_ACCOUNT},
{"on", GFS2_QUOTA_ON},
{"quiet", GFS2_QUOTA_QUIET},
{}
};
enum opt_data {
Opt_data_writeback = GFS2_DATA_WRITEBACK,
Opt_data_ordered = GFS2_DATA_ORDERED,
};
static const struct constant_table gfs2_param_data[] = {
{"writeback", Opt_data_writeback },
{"ordered", Opt_data_ordered },
{}
};
enum opt_errors {
Opt_errors_withdraw = GFS2_ERRORS_WITHDRAW,
Opt_errors_panic = GFS2_ERRORS_PANIC,
};
static const struct constant_table gfs2_param_errors[] = {
{"withdraw", Opt_errors_withdraw },
{"panic", Opt_errors_panic },
{}
};
static const struct fs_parameter_spec gfs2_fs_parameters[] = {
fsparam_string ("lockproto", Opt_lockproto),
fsparam_string ("locktable", Opt_locktable),
fsparam_string ("hostdata", Opt_hostdata),
fsparam_flag ("spectator", Opt_spectator),
fsparam_flag ("norecovery", Opt_spectator),
fsparam_flag ("ignore_local_fs", Opt_ignore_local_fs),
fsparam_flag ("localflocks", Opt_localflocks),
fsparam_flag ("localcaching", Opt_localcaching),
fsparam_flag_no("debug", Opt_debug),
fsparam_flag ("upgrade", Opt_upgrade),
fsparam_flag_no("acl", Opt_acl),
fsparam_flag_no("suiddir", Opt_suiddir),
fsparam_enum ("data", Opt_data, gfs2_param_data),
fsparam_flag ("meta", Opt_meta),
fsparam_flag_no("discard", Opt_discard),
fsparam_s32 ("commit", Opt_commit),
fsparam_enum ("errors", Opt_errors, gfs2_param_errors),
fsparam_s32 ("statfs_quantum", Opt_statfs_quantum),
fsparam_s32 ("statfs_percent", Opt_statfs_percent),
fsparam_s32 ("quota_quantum", Opt_quota_quantum),
fsparam_flag_no("barrier", Opt_barrier),
fsparam_flag_no("rgrplvb", Opt_rgrplvb),
fsparam_flag_no("loccookie", Opt_loccookie),
/* quota can be a flag or an enum so it gets special treatment */
fsparam_flag_no("quota", Opt_quota_flag),
fsparam_enum("quota", Opt_quota, gfs2_param_quota),
{}
};
/* Parse a single mount parameter */
static int gfs2_parse_param(struct fs_context *fc, struct fs_parameter *param)
{
struct gfs2_args *args = fc->fs_private;
struct fs_parse_result result;
int o;
o = fs_parse(fc, gfs2_fs_parameters, param, &result);
if (o < 0)
return o;
switch (o) {
case Opt_lockproto:
strscpy(args->ar_lockproto, param->string, GFS2_LOCKNAME_LEN);
break;
case Opt_locktable:
strscpy(args->ar_locktable, param->string, GFS2_LOCKNAME_LEN);
break;
case Opt_hostdata:
strscpy(args->ar_hostdata, param->string, GFS2_LOCKNAME_LEN);
break;
case Opt_spectator:
args->ar_spectator = 1;
break;
case Opt_ignore_local_fs:
/* Retained for backwards compat only */
break;
case Opt_localflocks:
args->ar_localflocks = 1;
break;
case Opt_localcaching:
/* Retained for backwards compat only */
break;
case Opt_debug:
if (result.boolean && args->ar_errors == GFS2_ERRORS_PANIC)
return invalfc(fc, "-o debug and -o errors=panic are mutually exclusive");
args->ar_debug = result.boolean;
break;
case Opt_upgrade:
/* Retained for backwards compat only */
break;
case Opt_acl:
args->ar_posix_acl = result.boolean;
break;
case Opt_quota_flag:
args->ar_quota = result.negated ? GFS2_QUOTA_OFF : GFS2_QUOTA_ON;
break;
case Opt_quota:
args->ar_quota = result.int_32;
break;
case Opt_suiddir:
args->ar_suiddir = result.boolean;
break;
case Opt_data:
/* The uint_32 result maps directly to GFS2_DATA_* */
args->ar_data = result.uint_32;
break;
case Opt_meta:
args->ar_meta = 1;
break;
case Opt_discard:
args->ar_discard = result.boolean;
break;
case Opt_commit:
if (result.int_32 <= 0)
return invalfc(fc, "commit mount option requires a positive numeric argument");
args->ar_commit = result.int_32;
break;
case Opt_statfs_quantum:
if (result.int_32 < 0)
return invalfc(fc, "statfs_quantum mount option requires a non-negative numeric argument");
args->ar_statfs_quantum = result.int_32;
break;
case Opt_quota_quantum:
if (result.int_32 <= 0)
return invalfc(fc, "quota_quantum mount option requires a positive numeric argument");
args->ar_quota_quantum = result.int_32;
break;
case Opt_statfs_percent:
if (result.int_32 < 0 || result.int_32 > 100)
return invalfc(fc, "statfs_percent mount option requires a numeric argument between 0 and 100");
args->ar_statfs_percent = result.int_32;
break;
case Opt_errors:
if (args->ar_debug && result.uint_32 == GFS2_ERRORS_PANIC)
return invalfc(fc, "-o debug and -o errors=panic are mutually exclusive");
args->ar_errors = result.uint_32;
break;
case Opt_barrier:
args->ar_nobarrier = result.boolean;
break;
case Opt_rgrplvb:
args->ar_rgrplvb = result.boolean;
args->ar_got_rgrplvb = 1;
break;
case Opt_loccookie:
args->ar_loccookie = result.boolean;
break;
default:
return invalfc(fc, "invalid mount option: %s", param->key);
}
return 0;
}
static int gfs2_reconfigure(struct fs_context *fc)
{
struct super_block *sb = fc->root->d_sb;
struct gfs2_sbd *sdp = sb->s_fs_info;
struct gfs2_args *oldargs = &sdp->sd_args;
struct gfs2_args *newargs = fc->fs_private;
struct gfs2_tune *gt = &sdp->sd_tune;
int error = 0;
sync_filesystem(sb);
spin_lock(&gt->gt_spin);
oldargs->ar_commit = gt->gt_logd_secs;
oldargs->ar_quota_quantum = gt->gt_quota_quantum;
if (gt->gt_statfs_slow)
oldargs->ar_statfs_quantum = 0;
else
oldargs->ar_statfs_quantum = gt->gt_statfs_quantum;
spin_unlock(&gt->gt_spin);
if (strcmp(newargs->ar_lockproto, oldargs->ar_lockproto)) {
errorfc(fc, "reconfiguration of locking protocol not allowed");
return -EINVAL;
}
if (strcmp(newargs->ar_locktable, oldargs->ar_locktable)) {
errorfc(fc, "reconfiguration of lock table not allowed");
return -EINVAL;
}
if (strcmp(newargs->ar_hostdata, oldargs->ar_hostdata)) {
errorfc(fc, "reconfiguration of host data not allowed");
return -EINVAL;
}
if (newargs->ar_spectator != oldargs->ar_spectator) {
errorfc(fc, "reconfiguration of spectator mode not allowed");
return -EINVAL;
}
if (newargs->ar_localflocks != oldargs->ar_localflocks) {
errorfc(fc, "reconfiguration of localflocks not allowed");
return -EINVAL;
}
if (newargs->ar_meta != oldargs->ar_meta) {
errorfc(fc, "switching between gfs2 and gfs2meta not allowed");
return -EINVAL;
}
if (oldargs->ar_spectator)
fc->sb_flags |= SB_RDONLY;
if ((sb->s_flags ^ fc->sb_flags) & SB_RDONLY) {
if (fc->sb_flags & SB_RDONLY) {
gfs2_make_fs_ro(sdp);
} else {
error = gfs2_make_fs_rw(sdp);
if (error)
errorfc(fc, "unable to remount read-write");
}
}
sdp->sd_args = *newargs;
if (sdp->sd_args.ar_posix_acl)
sb->s_flags |= SB_POSIXACL;
else
sb->s_flags &= ~SB_POSIXACL;
if (sdp->sd_args.ar_nobarrier)
set_bit(SDF_NOBARRIERS, &sdp->sd_flags);
else
clear_bit(SDF_NOBARRIERS, &sdp->sd_flags);
spin_lock(&gt->gt_spin);
gt->gt_logd_secs = newargs->ar_commit;
gt->gt_quota_quantum = newargs->ar_quota_quantum;
if (newargs->ar_statfs_quantum) {
gt->gt_statfs_slow = 0;
gt->gt_statfs_quantum = newargs->ar_statfs_quantum;
}
else {
gt->gt_statfs_slow = 1;
gt->gt_statfs_quantum = 30;
}
spin_unlock(&gt->gt_spin);
gfs2_online_uevent(sdp);
return error;
}
static const struct fs_context_operations gfs2_context_ops = {
.free = gfs2_fc_free,
.parse_param = gfs2_parse_param,
.get_tree = gfs2_get_tree,
.reconfigure = gfs2_reconfigure,
};
/* Set up the filesystem mount context */
static int gfs2_init_fs_context(struct fs_context *fc)
{
struct gfs2_args *args;
args = kmalloc(sizeof(*args), GFP_KERNEL);
if (args == NULL)
return -ENOMEM;
if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE) {
struct gfs2_sbd *sdp = fc->root->d_sb->s_fs_info;
*args = sdp->sd_args;
} else {
memset(args, 0, sizeof(*args));
args->ar_quota = GFS2_QUOTA_DEFAULT;
args->ar_data = GFS2_DATA_DEFAULT;
args->ar_commit = 30;
args->ar_statfs_quantum = 30;
args->ar_quota_quantum = 60;
args->ar_errors = GFS2_ERRORS_DEFAULT;
}
fc->fs_private = args;
fc->ops = &gfs2_context_ops;
return 0;
}
static int set_meta_super(struct super_block *s, struct fs_context *fc)
{
return -EINVAL;
}
static int test_meta_super(struct super_block *s, struct fs_context *fc)
{
return (fc->sget_key == s->s_bdev);
}
static int gfs2_meta_get_tree(struct fs_context *fc)
{
struct super_block *s;
struct gfs2_sbd *sdp;
struct path path;
int error;
if (!fc->source || !*fc->source)
return -EINVAL;
error = kern_path(fc->source, LOOKUP_FOLLOW, &path);
if (error) {
pr_warn("path_lookup on %s returned error %d\n",
fc->source, error);
return error;
}
fc->fs_type = &gfs2_fs_type;
fc->sget_key = path.dentry->d_sb->s_bdev;
s = sget_fc(fc, test_meta_super, set_meta_super);
path_put(&path);
if (IS_ERR(s)) {
pr_warn("gfs2 mount does not exist\n");
return PTR_ERR(s);
}
if ((fc->sb_flags ^ s->s_flags) & SB_RDONLY) {
deactivate_locked_super(s);
return -EBUSY;
}
sdp = s->s_fs_info;
fc->root = dget(sdp->sd_master_dir);
return 0;
}
static const struct fs_context_operations gfs2_meta_context_ops = {
.free = gfs2_fc_free,
.get_tree = gfs2_meta_get_tree,
};
static int gfs2_meta_init_fs_context(struct fs_context *fc)
{
int ret = gfs2_init_fs_context(fc);
if (ret)
return ret;
fc->ops = &gfs2_meta_context_ops;
return 0;
}
/**
* gfs2_evict_inodes - evict inodes cooperatively
* @sb: the superblock
*
* When evicting an inode with a zero link count, we are trying to upgrade the
* inode's iopen glock from SH to EX mode in order to determine if we can
* delete the inode. The other nodes are supposed to evict the inode from
* their caches if they can, and to poke the inode's inode glock if they cannot
* do so. Either behavior allows gfs2_upgrade_iopen_glock() to proceed
* quickly, but if the other nodes are not cooperating, the lock upgrading
* attempt will time out. Since inodes are evicted sequentially, this can add
* up quickly.
*
* Function evict_inodes() tries to keep the s_inode_list_lock list locked over
* a long time, which prevents other inodes from being evicted concurrently.
* This precludes the cooperative behavior we are looking for. This special
* version of evict_inodes() avoids that.
*
* Modeled after drop_pagecache_sb().
*/
static void gfs2_evict_inodes(struct super_block *sb)
{
struct inode *inode, *toput_inode = NULL;
struct gfs2_sbd *sdp = sb->s_fs_info;
set_bit(SDF_EVICTING, &sdp->sd_flags);
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) &&
!need_resched()) {
spin_unlock(&inode->i_lock);
continue;
}
atomic_inc(&inode->i_count);
spin_unlock(&inode->i_lock);
spin_unlock(&sb->s_inode_list_lock);
iput(toput_inode);
toput_inode = inode;
cond_resched();
spin_lock(&sb->s_inode_list_lock);
}
spin_unlock(&sb->s_inode_list_lock);
iput(toput_inode);
}
static void gfs2_kill_sb(struct super_block *sb)
{
struct gfs2_sbd *sdp = sb->s_fs_info;
if (sdp == NULL) {
kill_block_super(sb);
return;
}
gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_SYNC | GFS2_LFC_KILL_SB);
dput(sdp->sd_root_dir);
dput(sdp->sd_master_dir);
sdp->sd_root_dir = NULL;
sdp->sd_master_dir = NULL;
shrink_dcache_sb(sb);
gfs2_evict_inodes(sb);
/*
* Flush and then drain the delete workqueue here (via
* destroy_workqueue()) to ensure that any delete work that
* may be running will also see the SDF_KILL flag.
*/
set_bit(SDF_KILL, &sdp->sd_flags);
gfs2_flush_delete_work(sdp);
destroy_workqueue(sdp->sd_delete_wq);
kill_block_super(sb);
}
struct file_system_type gfs2_fs_type = {
.name = "gfs2",
.fs_flags = FS_REQUIRES_DEV,
.init_fs_context = gfs2_init_fs_context,
.parameters = gfs2_fs_parameters,
.kill_sb = gfs2_kill_sb,
.owner = THIS_MODULE,
};
fs: Limit sys_mount to only request filesystem modules. Modify the request_module to prefix the file system type with "fs-" and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Reported-by: Kees Cook <keescook@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-03 03:39:14 +00:00
MODULE_ALIAS_FS("gfs2");
struct file_system_type gfs2meta_fs_type = {
.name = "gfs2meta",
.fs_flags = FS_REQUIRES_DEV,
.init_fs_context = gfs2_meta_init_fs_context,
.owner = THIS_MODULE,
};
fs: Limit sys_mount to only request filesystem modules. Modify the request_module to prefix the file system type with "fs-" and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Reported-by: Kees Cook <keescook@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-03 03:39:14 +00:00
MODULE_ALIAS_FS("gfs2meta");