Commit graph

267 commits

Author SHA1 Message Date
Bob Peterson
c860f8ffbe gfs2: The freeze glock should never be frozen
Before this patch, some gfs2 code locked the freeze glock with LM_FLAG_NOEXP
(Do not freeze) flag, and some did not. We never want to freeze the freeze
glock, so this patch makes it consistently use LM_FLAG_NOEXP always.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-07-03 12:05:35 +02:00
Bob Peterson
b780cc615b gfs2: read-only mounts should grab the sd_freeze_gl glock
Before this patch, only read-write mounts would grab the freeze
glock in read-only mode, as part of gfs2_make_fs_rw. So the freeze
glock was never initialized. That meant requests to freeze, which
request the glock in EX, were granted without any state transition.
That meant you could mount a gfs2 file system, which is currently
frozen on a different cluster node, in read-only mode.

This patch makes read-only mounts lock the freeze glock in SH mode,
which will block for file systems that are frozen on another node.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-07-03 12:05:35 +02:00
Bob Peterson
ea22eee4e6 gfs2: Allow lock_nolock mount to specify jid=X
Before this patch, a simple typo accidentally added \n to the jid=
string for lock_nolock mounts. This made it impossible to mount a
gfs2 file system with a journal other than journal0. Thus:

mount -tgfs2 -o hostdata="jid=1" <device> <mount pt>

Resulted in:
mount: wrong fs type, bad option, bad superblock on <device>

In most cases this is not a problem. However, for debugging and
testing purposes we sometimes want to test the integrity of other
journals. This patch removes the unnecessary \n and thus allows
lock_nolock users to specify an alternate journal.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-06-02 19:45:05 +02:00
Linus Torvalds
018d21f5c5 We've got a lot of patches (39) for this merge window. Most of these patches
are related to corruption that occurs when journals are replayed.
 For example:
 
    1. A node fails while writing to the file system.
    2. Other nodes use the metadata that was once used by the failed node.
    3. When the node returns to the cluster, its journal is replayed,
       but the older metadata blocks overwrite the changes from step 2.
 
 - Fixed the recovery sequence to prevent corruption during journal replay.
 - Many bug fixes found during recovery testing.
 - New improved file system withdraw sequence.
 - Fixed how resource group buffers are managed.
 - Fixed how metadata revokes are tracked and written.
 - Improve processing of IO errors hit by daemons like logd and quotad.
 - Improved error checking in metadata writes.
 - Fixed how qadata quota data structures are managed.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEE89F0ZrnZapxy/9qS14th09/3ejsFAl6Db/QACgkQ14th09/3
 ejvVTgf+IdHXfmpv3ftah8lDDpbsnSKZYRC1NW7skQB+NVG9KtJhtzy1nldaMqMv
 s8wQ5aGKrfBfmzg8IZ9Pt3dCItFqC5d8IqcO0M0FtNuyN+27ETUUMnqBf1NwL6wI
 iAm/+ncZ/BiZN2P8MgXV3OgRGvaC9ebmz860+nthwyJT+6y8d8Qab7pUfyix5e0d
 oTgDhEJqF0DOrGsrlS5rxjTU+RMixtepsAW958D4Eks28OlyduRAj6fAMDoLN2/E
 WoDpX6iKeczH0lOZxnIVQOkCztDaa0jDlK2JK7sJRBMpNxj77aUn4cffY+b/A4kk
 sR5gjsiHoesdAMEpHIXSdEcYMIstIg==
 =VEKB
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Bob Peterson:
 "We've got a lot of patches (39) for this merge window. Most of these
  patches are related to corruption that occurs when journals are
  replayed. For example:

   1. A node fails while writing to the file system.
   2. Other nodes use the metadata that was once used by the failed
      node.
   3. When the node returns to the cluster, its journal is replayed, but
      the older metadata blocks overwrite the changes from step 2.

  Summary:

   - Fixed the recovery sequence to prevent corruption during journal
     replay.

   - Many bug fixes found during recovery testing.

   - New improved file system withdraw sequence.

   - Fixed how resource group buffers are managed.

   - Fixed how metadata revokes are tracked and written.

   - Improve processing of IO errors hit by daemons like logd and
     quotad.

   - Improved error checking in metadata writes.

   - Fixed how qadata quota data structures are managed"

* tag 'gfs2-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (39 commits)
  gfs2: Fix oversight in gfs2_ail1_flush
  gfs2: change from write to read lock for sd_log_flush_lock in journal replay
  gfs2: instrumentation wrt ail1 stuck
  gfs2: don't lock sd_log_flush_lock in try_rgrp_unlink
  gfs2: Remove unnecessary gfs2_qa_{get,put} pairs
  gfs2: Split gfs2_rsqa_delete into gfs2_rs_delete and gfs2_qa_put
  gfs2: Change inode qa_data to allow multiple users
  gfs2: eliminate gfs2_rsqa_alloc in favor of gfs2_qa_alloc
  gfs2: Switch to list_{first,last}_entry
  gfs2: Clean up inode initialization and teardown
  gfs2: Additional information when gfs2_ail1_flush withdraws
  gfs2: leaf_dealloc needs to allocate one more revoke
  gfs2: allow journal replay to hold sd_log_flush_lock
  gfs2: don't allow releasepage to free bd still used for revokes
  gfs2: flesh out delayed withdraw for gfs2_log_flush
  gfs2: Do proper error checking for go_sync family of glops functions
  gfs2: Don't demote a glock until its revokes are written
  gfs2: drain the ail2 list after io errors
  gfs2: Withdraw in gfs2_ail1_flush if write_cache_pages fails
  gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty
  ...
2020-03-31 14:16:03 -07:00
Bob Peterson
7d9f924958 gfs2: Add verbose option to check_journal_clean
Before this patch, function check_journal_clean would give messages
related to journal recovery. That's fine for mount time, but when a
node withdraws and forces replay that way, we don't want all those
distracting and misleading messages. This patch adds a new parameter
to make those messages optional.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:17 -06:00
Bob Peterson
601ef0d52e gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.

This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.

The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.

Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.

The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.

Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.

Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.

We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.

Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.

For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-27 07:53:12 -06:00
Bob Peterson
a72d2401f5 gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.

Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-20 11:01:36 -06:00
Bob Peterson
0d91061a37 gfs2: move check_journal_clean to util.c for future use
Before this patch function check_journal_clean was in ops_fstype.c.
This patch moves it to util.c so we can make use of it elsewhere
in a future patch.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-10 07:39:51 -06:00
Linus Torvalds
c9d35ee049 Merge branch 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs file system parameter updates from Al Viro:
 "Saner fs_parser.c guts and data structures. The system-wide registry
  of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
  the horror switch() in fs_parse() that would have to grow another case
  every time something got added to that system-wide registry.

  New syntax types can be added by filesystems easily now, and their
  namespace is that of functions - not of system-wide enum members. IOW,
  they can be shared or kept private and if some turn out to be widely
  useful, we can make them common library helpers, etc., without having
  to do anything whatsoever to fs_parse() itself.

  And we already get that kind of requests - the thing that finally
  pushed me into doing that was "oh, and let's add one for timeouts -
  things like 15s or 2h". If some filesystem really wants that, let them
  do it. Without somebody having to play gatekeeper for the variants
  blessed by direct support in fs_parse(), TYVM.

  Quite a bit of boilerplate is gone. And IMO the data structures make a
  lot more sense now. -200LoC, while we are at it"

* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
  tmpfs: switch to use of invalfc()
  cgroup1: switch to use of errorfc() et.al.
  procfs: switch to use of invalfc()
  hugetlbfs: switch to use of invalfc()
  cramfs: switch to use of errofc() et.al.
  gfs2: switch to use of errorfc() et.al.
  fuse: switch to use errorfc() et.al.
  ceph: use errorfc() and friends instead of spelling the prefix out
  prefix-handling analogues of errorf() and friends
  turn fs_param_is_... into functions
  fs_parse: handle optional arguments sanely
  fs_parse: fold fs_parameter_desc/fs_parameter_spec
  fs_parser: remove fs_parameter_description name field
  add prefix to fs_context->log
  ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
  new primitive: __fs_parse()
  switch rbd and libceph to p_log-based primitives
  struct p_log, variants of warnf() et.al. taking that one instead
  teach logfc() to handle prefices, give it saner calling conventions
  get rid of cg_invalf()
  ...
2020-02-08 13:26:41 -08:00
Al Viro
77cb271e6a gfs2: switch to use of errorfc() et.al.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:41 -05:00
Al Viro
48ce73b1be fs_parse: handle optional arguments sanely
Don't bother with "mixed" options that would allow both the
form with and without argument (i.e. both -o foo and -o foo=bar).
Rather than trying to shove both into a single fs_parameter_spec,
allow having with-argument and no-argument specs with the same
name and teach fs_parse to handle that.

There are very few options of that sort, and they are actually
easier to handle that way - callers end up with less postprocessing.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:37 -05:00
Al Viro
d7167b1499 fs_parse: fold fs_parameter_desc/fs_parameter_spec
The former contains nothing but a pointer to an array of the latter...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:37 -05:00
Eric Sandeen
96cafb9ccb fs_parser: remove fs_parameter_description name field
Unused now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:36 -05:00
Al Viro
5eede62529 fold struct fs_parameter_enum into struct constant_table
no real difference now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 00:12:50 -05:00
Al Viro
2710c957a8 fs_parse: get rid of ->enums
Don't do a single array; attach them to fsparam_enum() entry
instead.  And don't bother trying to embed the names into those -
it actually loses memory, with no real speedup worth mentioning.

Simplifies validation as well.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 00:12:50 -05:00
Bob Peterson
2e9eeaa117 gfs2: eliminate ssize parameter from gfs2_struct2blk
Every caller of function gfs2_struct2blk specified sizeof(u64).

This patch eliminates the unnecessary parameter and replaces the
size calculation with a new superblock variable that is computed
to be the maximum number of block pointers we can fit inside a
log descriptor, as is done for pointers per dinode and indirect
block.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-01-07 18:46:06 +01:00
Bob Peterson
eb43e660c0 gfs2: Introduce function gfs2_withdrawn
Add function gfs2_withdrawn and replace all checks for the SDF_WITHDRAWN
bit to call it. This does not change the logic or function of gfs2, and
it facilitates later improvements to the withdraw sequence.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-14 19:46:18 +01:00
Ben Dooks (Codethink)
1a48049adb gfs2: make gfs2_fs_parameters static
The gfs2_fs_parameters is not used outside the unit
it is declared in, so make it static.

Fixes the following sparse warning:

fs/gfs2/ops_fstype.c:1331:39: warning: symbol 'gfs2_fs_parameters' was not declared. Should it be static?

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reviewed-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-10-30 12:17:04 +01:00
Andrew Price
d5798141fd gfs2: Fix initialisation of args for remount
When gfs2 was converted to use fs_context, the initialisation of the
mount args structure to the currently active args was lost with the
removal of gfs2_remount_fs(), so the checks of the new args on remount
became checks against the default values instead of the current ones.
This caused unexpected remount behaviour and test failures (xfstests
generic/294, generic/306 and generic/452).

Reinstate the args initialisation, this time in gfs2_init_fs_context()
and conditional upon fc->purpose, as that's the only time we get control
before the mount args are parsed in the remount process.

Fixes: 1f52aa08d1 ("gfs2: Convert gfs2 to fs_context")
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-10-30 12:16:53 +01:00
Andrew Price
30aecae86e gfs2: Fix memory leak when gfs2meta's fs_context is freed
gfs2 and gfs2meta share an ->init_fs_context function which allocates an
args structure stored in fc->fs_private. gfs2 registers a ->free
function to free this memory when the fs_context is cleaned up, but
there was not one registered for gfs2meta, causing a leak.

Register a ->free function for gfs2meta. The existing gfs2_fc_free
function does what we need.

Reported-by: syzbot+c2fdfd2b783754878fb6@syzkaller.appspotmail.com
Fixes: 1f52aa08d1 ("gfs2: Convert gfs2 to fs_context")
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-10-24 16:20:43 +02:00
Linus Torvalds
0b36c9eed2 Merge branch 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more mount API conversions from Al Viro:
 "Assorted conversions of options parsing to new API.

  gfs2 is probably the most serious one here; the rest is trivial stuff.

  Other things in what used to be #work.mount are going to wait for the
  next cycle (and preferably go via git trees of the filesystems
  involved)"

* 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  gfs2: Convert gfs2 to fs_context
  vfs: Convert spufs to use the new mount API
  vfs: Convert hypfs to use the new mount API
  hypfs: Fix error number left in struct pointer member
  vfs: Convert functionfs to use the new mount API
  vfs: Convert bpf to use the new mount API
2019-09-24 12:33:34 -07:00
Andrew Price
1f52aa08d1 gfs2: Convert gfs2 to fs_context
Convert gfs2 and gfs2meta to fs_context. Removes the duplicated vfs code
from gfs2_mount and instead uses the new vfs_get_block_super() before
switching the ->root to the appropriate dentry.

The mount option parsing has been converted to the new API and error
reporting for invalid options has been made more precise at the same
time.

All of the mount/remount code has been moved into ops_fstype.c

Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: cluster-devel@redhat.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-18 22:47:05 -04:00
Bob Peterson
ad26967b9a gfs2: Use async glocks for rename
Because s_vfs_rename_mutex is not cluster-wide, multiple nodes can
reverse the roles of which directories are "old" and which are "new" for
the purposes of rename. This can cause deadlocks where two nodes end up
waiting for each other.

There can be several layers of directory dependencies across many nodes.

This patch fixes the problem by acquiring all gfs2_rename's inode glocks
asychronously and waiting for all glocks to be acquired.  That way all
inodes are locked regardless of the order.

The timeout value for multiple asynchronous glocks is calculated to be
the total of the individual wait times for each glock times two.

Since gfs2_exchange is very similar to gfs2_rename, both functions are
patched in the same way.

A new async glock wait queue, sd_async_glock_wait, keeps a list of
waiters for these events. If gfs2's holder_wake function detects an
async holder, it wakes up any waiters for the event. The waiter only
tests whether any of its requests are still pending.

Since the glocks are sent to dlm asychronously, the wait function needs
to check to see which glocks, if any, were granted.

If a glock is granted by dlm (and therefore held), its minimum hold time
is checked and adjusted as necessary, as other glock grants do.

If the event times out, all glocks held thus far must be dequeued to
resolve any existing deadlocks.  Then, if there are any outstanding
locking requests, we need to loop around and wait for dlm to respond to
those requests too.  After we release all requests, we return -ESTALE to
the caller (vfs rename) which loops around and retries the request.

    Node1           Node2
    ---------       ---------
1.  Enqueue A       Enqueue B
2.  Enqueue B       Enqueue A
3.  A granted
6.                  B granted
7.  Wait for B
8.                  Wait for A
9.                  A times out (since Node 1 holds A)
10.                 Dequeue B (since it was granted)
11.                 Wait for all requests from DLM
12. B Granted (since Node2 released it in step 10)
13. Rename
14. Dequeue A
15.                 DLM Grants A
16.                 Dequeue A (due to the timeout and since we
                    no longer have B held for our task).
17. Dequeue B
18.                 Return -ESTALE to vfs
19.                 VFS retries the operation, goto step 1.

This release-all-locks / acquire-all-locks may slow rename / exchange
down as both nodes struggle in the same way and do the same thing.
However, this will only happen when there is contention for the same
inodes, which ought to be rare.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-09-04 20:22:17 +02:00
Bob Peterson
04aea0ca14 gfs2: Rename SDF_SHUTDOWN to SDF_WITHDRAWN
Before this patch, the superblock flag indicating when a file system
is withdrawn was called SDF_SHUTDOWN. This patch simply renames it to
the more obvious SDF_WITHDRAWN.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-27 21:26:35 +02:00
Kefeng Wang
15a798f7de gfs2: Use IS_ERR_OR_NULL
Use IS_ERR_OR_NULL where appropriate.

(Several more places converted by Andreas.)

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-27 20:53:46 +02:00
Andreas Gruenbacher
2a27b755ed gfs2: Clean up freeing struct gfs2_sbd
Add a free_sbd function for freeing a struct gfs2_sbd.  Use that for
freeing a super-block descriptor, either directly or via kobject_put.
Free sd_lkstats inside the kobject release function: that way,
gfs2_put_super will no longer leak sd_lkstats.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-27 20:53:45 +02:00
Thomas Gleixner
7336d0e654 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 398
Based on 1 normalized pattern(s):

  this copyrighted material is made available to anyone wishing to use
  modify copy or redistribute it subject to the terms and conditions
  of the gnu general public license version 2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 44 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531081038.653000175@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:12 +02:00
Abhi Das
f4686c26ec gfs2: read journal in large chunks
Use bios to read in the journal into the address space of the journal inode
(jd_inode), sequentially and in large chunks.  This is faster for locating the
journal head that the previous binary search approach.  When performing
recovery, we keep the journal in the address space until recovery is done,
which further speeds up things.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07 23:39:15 +02:00
Andreas Gruenbacher
a5b1d3fc50 gfs2: Rename sd_log_le_{revoke,ordered}
Rename sd_log_le_revoke to sd_log_revokes and sd_log_le_ordered to
sd_log_ordered: not sure what le stands for here, but it doesn't add
clarity, and if it stands for list entry, it's actually confusing as
those are both list heads but not list entries.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07 23:39:14 +02:00
Bob Peterson
23e93c9b2c Revert "gfs2: read journal in large chunks to locate the head"
This reverts commit 2a5f14f279.

This patch causes xfstests generic/311 to fail. Reverting this for
now until we have a proper fix.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-14 09:52:51 -08:00
Abhi Das
2a5f14f279 gfs2: read journal in large chunks to locate the head
Use bio(s) to read in the journal sequentially in large chunks and
locate the head of the journal.

This version addresses the issues Christoph pointed out w.r.t error handling
and using deprecated API.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
2018-12-11 17:50:36 +01:00
Linus Torvalds
bfd93a87ea We've got 18 patches for this merge window, none of which are very major.
1. Andreas Gruenbacher contributed several patches to clean up the gfs2
    block allocator to prepare for future performance enhancements.
 2. Andy Price contributed a patch to fix a use-after-free problem.
 3. I contributed some patches that fix gfs2's broken rgrplvb mount option.
 4. I contributed some cleanup patches and error message improvements.
 5. Steve Whitehouse and Abhi Das sent a patch to enable getlabel support.
 6. Tim Smith contributed a patch to flush the glock delete workqueue at exit.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJbz3QlAAoJENeLYdPf93o7+VUH/0XUfbyNQsU6fyLP8NrZq05z
 qNsVN3Hm+tPc0/V0C75lSp9ej7B8ogMl0RPysniziRTEWIDK6oGB/JUGvIH0O/z1
 vZ/sofZEDXthV3YjiI8RXLcaJLsOavSXnGwHbNKohM2PdRObVkZbaUL+xWlL9X3q
 yHgP5AHCIrpVzz5l4sLO6N0Npnl0aNRTBxPIyDTaBBmitXkvtqkCbkw185jzkDDs
 fMvZ6I+3UbUxp99InFTHeUXvTr1EbvfPrhZzmppuV1N4LLSa1eRaWmsKTEDPdnsy
 uhsh9ittv8EJXN2dpmZIOdGmDEK07kFoZrsbM5F78sOH/LbUyJ5YfBN02lDaaAI=
 =b7zk
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.20.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Bob Peterson:
 "We've got 18 patches for this merge window, none of which are very
  major:

   - clean up the gfs2 block allocator to prepare for future performance
     enhancements (Andreas Gruenbacher)

   - fix a use-after-free problem (Andy Price)

   - patches that fix gfs2's broken rgrplvb mount option (me)

   - cleanup patches and error message improvements (me)

   - enable getlabel support (Steve Whitehouse and Abhi Das)

   - flush the glock delete workqueue at exit (Tim Smith)"

* tag 'gfs2-4.20.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix minor typo: couln't versus couldn't.
  gfs2: write revokes should traverse sd_ail1_list in reverse
  gfs2: Pass resource group to rgblk_free
  gfs2: Remove unnecessary gfs2_rlist_alloc parameter
  gfs2: Fix marking bitmaps non-full
  gfs2: Fix some minor typos
  gfs2: Rename bitmap.bi_{len => bytes}
  gfs2: Remove unused RGRP_RSRV_MINBYTES definition
  gfs2: Move rs_{sizehint, rgd_gh} fields into the inode
  gfs2: Clean up out-of-bounds check in gfs2_rbm_from_block
  gfs2: Always check the result of gfs2_rbm_from_block
  gfs2: getlabel support
  GFS2: Flush the GFS2 delete workqueue before stopping the kernel threads
  gfs2: Don't leave s_fs_info pointing to freed memory in init_sbd
  gfs2: Use fs_* functions instead of pr_* function where we can
  gfs2: slow the deluge of io error messages
  gfs2: Don't set GFS2_RDF_UPTODATE when the lvb is updated
  gfs2: improve debug information when lvb mismatches are found
2018-10-24 17:30:39 +01:00
Al Viro
3df629d873 gfs2_meta: ->mount() can get NULL dev_name
get in sync with mount_bdev() handling of the same

Reported-by: syzbot+c54f8e94e6bba03b04e9@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-13 00:19:13 -04:00
Andrew Price
4c62bd9cea gfs2: Don't leave s_fs_info pointing to freed memory in init_sbd
When alloc_percpu() fails, sdp gets freed but sb->s_fs_info still points
to the same address. Move the assignment after that error check so that
s_fs_info can only point to a valid sdp or NULL, which is checked for
later in the error path, in gfs2_kill_super().

Reported-by: syzbot+dcb8b3587445007f5808@syzkaller.appspotmail.com
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-08 07:52:43 -05:00
Andreas Gruenbacher
9a38662ba4 gfs2: Remove sdp->sd_jheightsize
GFS2 keeps two arrarys in the superblock that define the maximum size of
an inode depending on the inode's height: sdp->sd_heightsize defines the
heights in units of sb->s_blocksize; sdp->sd_jheightsize defines them in
units of sb->s_blocksize - sizeof(struct gfs2_meta_header).  These
arrays are used to determine when additional layers of indirect blocks
are needed.  The second array is used for directories which have an
additional gfs2_meta_header at the beginning of each block.

Distinguishing between these two cases makes no sense: the height
required for representing N blocks will come out the same no matter if
the calculation is done in gross (sb->s_blocksize) or net
(sb->s_blocksize - sizeof(struct gfs2_meta_header)) units.

Stuffed directories don't have an additional gfs2_meta_header, but the
stuffed case is handled separately for both files and directories,
anyway.

Remove the unncessary sdp->sd_jheightsize array.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-04-16 09:25:21 -07:00
Bob Peterson
3e7aafc39c GFS2: Minor improvements to comments and documentation
This patch simply fixes some comments and the gfs2-glocks.txt file:
Places where i_rwsem was called i_mutex, and adding i_rw_mutex.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-04-12 10:07:51 -07:00
Bob Peterson
805c090750 GFS2: Log the reason for log flushes in every log header
This patch just adds the capability for GFS2 to track which function
called gfs2_log_flush. This should make it easier to diagnose
problems based on the sequence of events found in the journals.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-01-23 07:39:20 -07:00
Bob Peterson
c1696fb85d GFS2: Introduce new gfs2_log_header_v2
This patch adds a new structure called gfs2_log_header_v2 which is used
to store expanded fields into previously unused areas of the log headers
(i.e., this change is backwards compatible).  Some of these are used for
debug purposes so we can backtrack when problems occur.  Others are
reserved for future expansion.

This patch is based on a prototype from Steve Whitehouse.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-01-23 07:38:53 -07:00
Linus Torvalds
1751e8a6cb Rename superblock flags (MS_xyz -> SB_xyz)
This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.

The script to do this was:

    # places to look in; re security/*: it generally should *not* be
    # touched (that stuff parses mount(2) arguments directly), but
    # there are two places where we really deal with superblock flags.
    FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
            include/linux/fs.h include/uapi/linux/bfs_fs.h \
            security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
    # the list of MS_... constants
    SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
          DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
          POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
          I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
          ACTIVE NOUSER"

    SED_PROG=
    for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

    # we want files that contain at least one of MS_...,
    # with fs/namespace.c and fs/pnode.c excluded.
    L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

    for f in $L; do sed -i $f $SED_PROG; done

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-27 13:05:09 -08:00
Linus Torvalds
0f0d12728e Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount flag updates from Al Viro:
 "Another chunk of fmount preparations from dhowells; only trivial
  conflicts for that part. It separates MS_... bits (very grotty
  mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
  only a small subset of MS_... stuff).

  This does *not* convert the filesystems to new constants; only the
  infrastructure is done here. The next step in that series is where the
  conflicts would be; that's the conversion of filesystems. It's purely
  mechanical and it's better done after the merge, so if you could run
  something like

	list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

	sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
	        -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
	        -e 's/\<MS_NODEV\>/SB_NODEV/g' \
	        -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
	        -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
	        -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
	        -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
	        -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
	        -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
	        -e 's/\<MS_SILENT\>/SB_SILENT/g' \
	        -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
	        -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
	        -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
	        -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
	        $list

  and commit it with something along the lines of 'convert filesystems
  away from use of MS_... constants' as commit message, it would save a
  quite a bit of headache next cycle"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: Differentiate mount flags (MS_*) from internal superblock flags
  VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
  vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-14 18:54:01 -07:00
Linus Torvalds
a0725ab0c7 Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the first pull request for 4.14, containing most of the code
  changes. It's a quiet series this round, which I think we needed after
  the churn of the last few series. This contains:

   - Fix for a registration race in loop, from Anton Volkov.

   - Overflow complaint fix from Arnd for DAC960.

   - Series of drbd changes from the usual suspects.

   - Conversion of the stec/skd driver to blk-mq. From Bart.

   - A few BFQ improvements/fixes from Paolo.

   - CFQ improvement from Ritesh, allowing idling for group idle.

   - A few fixes found by Dan's smatch, courtesy of Dan.

   - A warning fixup for a race between changing the IO scheduler and
     device remova. From David Jeffery.

   - A few nbd fixes from Josef.

   - Support for cgroup info in blktrace, from Shaohua.

   - Also from Shaohua, new features in the null_blk driver to allow it
     to actually hold data, among other things.

   - Various corner cases and error handling fixes from Weiping Zhang.

   - Improvements to the IO stats tracking for blk-mq from me. Can
     drastically improve performance for fast devices and/or big
     machines.

   - Series from Christoph removing bi_bdev as being needed for IO
     submission, in preparation for nvme multipathing code.

   - Series from Bart, including various cleanups and fixes for switch
     fall through case complaints"

* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
  kernfs: checking for IS_ERR() instead of NULL
  drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
  drbd: Fix allyesconfig build, fix recent commit
  drbd: switch from kmalloc() to kmalloc_array()
  drbd: abort drbd_start_resync if there is no connection
  drbd: move global variables to drbd namespace and make some static
  drbd: rename "usermode_helper" to "drbd_usermode_helper"
  drbd: fix race between handshake and admin disconnect/down
  drbd: fix potential deadlock when trying to detach during handshake
  drbd: A single dot should be put into a sequence.
  drbd: fix rmmod cleanup, remove _all_ debugfs entries
  drbd: Use setup_timer() instead of init_timer() to simplify the code.
  drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
  drbd: new disk-option disable-write-same
  drbd: Fix resource role for newly created resources in events2
  drbd: mark symbols static where possible
  drbd: Send P_NEG_ACK upon write error in protocol != C
  drbd: add explicit plugging when submitting batches
  drbd: change list_for_each_safe to while(list_first_entry_or_null)
  drbd: introduce drbd_recv_header_maybe_unplug
  ...
2017-09-07 11:59:42 -07:00
Andreas Gruenbacher
561b796987 gfs2: Silence gcc format-truncation warning
Enlarge sd_fsname to be big enough for the longest long lock table name
and an arbitrary journal number.  This silences two -Wformat-truncation
warnings with gcc 7.1.1.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-25 10:59:21 -05:00
Christoph Hellwig
74d46992e0 block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:55 -06:00
Bob Peterson
b2fb7dab7f GFS2: Delete debugfs files only after we evict the glocks
This patch moves the call to gfs2_delete_debugfs_file so that it
comes after the glock hash table has been cleared. This way we
can query the debugfs files if umount hangs.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-09 09:36:39 -05:00
David Howells
bc98a42c1f VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
Firstly by applying the following with coccinelle's spatch:

	@@ expression SB; @@
	-SB->s_flags & MS_RDONLY
	+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

	@@ expression A, SB; @@
	(
	-(!sb_rdonly(SB)) && A
	+!sb_rdonly(SB) && A
	|
	-A != (sb_rdonly(SB))
	+A != sb_rdonly(SB)
	|
	-A == (sb_rdonly(SB))
	+A == sb_rdonly(SB)
	|
	-!(sb_rdonly(SB))
	+!sb_rdonly(SB)
	|
	-A && (sb_rdonly(SB))
	+A && sb_rdonly(SB)
	|
	-A || (sb_rdonly(SB))
	+A || sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) != A
	+sb_rdonly(SB) != A
	|
	-(sb_rdonly(SB)) == A
	+sb_rdonly(SB) == A
	|
	-(sb_rdonly(SB)) && A
	+sb_rdonly(SB) && A
	|
	-(sb_rdonly(SB)) || A
	+sb_rdonly(SB) || A
	)

	@@ expression A, B, SB; @@
	(
	-(sb_rdonly(SB)) ? 1 : 0
	+sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) ? A : B
	+sb_rdonly(SB) ? A : B
	)

to remove left over excess bracketage and finally by applying:

	@@ expression A, SB; @@
	(
	-(A & MS_RDONLY) != sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
	|
	-(A & MS_RDONLY) == sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
	)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-17 08:45:34 +01:00
Christoph Hellwig
fdd050b5b3 Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into nvme-base 2017-06-13 11:45:14 +02:00
Christoph Hellwig
4e4cbee93d block: switch bios to blk_status_t
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
85787090a2 fs: switch ->s_uuid to uuid_t
For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers.  More to come..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:59:12 +02:00
Jan Kara
c1844d536d fs: Remove SB_I_DYNBDI flag
Now that all bdi structures filesystems use are properly refcounted, we
can remove the SB_I_DYNBDI flag.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
95fe66de9f gfs2: Convert to properly refcounting bdi
Similarly to set_bdev_super() GFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.

CC: Steven Whitehouse <swhiteho@redhat.com>
CC: Bob Peterson <rpeterso@redhat.com>
CC: cluster-devel@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00