Commit graph

62483 commits

Author SHA1 Message Date
Andreas Gruenbacher
4bd684bc01 gfs2: Remove unnecessary gfs2_qa_{get,put} pairs
We now get the quota data structure when opening a file writable and put it
when closing that writable file descriptor, so there no longer is a need for
gfs2_qa_{get,put} while we're holding a writable file descriptor.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-03-27 14:08:05 -05:00
Andreas Gruenbacher
1595548fe7 gfs2: Split gfs2_rsqa_delete into gfs2_rs_delete and gfs2_qa_put
Keeping reservations and quotas separate helps reviewing the code.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-03-27 14:08:04 -05:00
Bob Peterson
2fba46a04c gfs2: Change inode qa_data to allow multiple users
Before this patch, multiple users called gfs2_qa_alloc which allocated
a qadata structure to the inode, if quotas are turned on. Later, in
file close or evict, the structure was deleted with gfs2_qa_delete.
But there can be several competing processes who need access to the
structure. There were races between file close (release) and the others.
Thus, a release could delete the structure out from under a process
that relied upon its existence. For example, chown.

This patch changes the management of the qadata structures to be
a get/put scheme. Function gfs2_qa_alloc has been changed to gfs2_qa_get
and if the structure is allocated, the count essentially starts out at
1. Function gfs2_qa_delete has been renamed to gfs2_qa_put, and the
last guy to decrement the count to 0 frees the memory.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-03-27 14:08:04 -05:00
Bob Peterson
d580712a37 gfs2: eliminate gfs2_rsqa_alloc in favor of gfs2_qa_alloc
Before this patch, multiple callers called gfs2_rsqa_alloc to force
the existence of a reservations structure and a quota data structure
if needed. However, now the reservations are handled separately, so
the quota data is only the quota data. So we eliminate the one in
favor of just calling gfs2_qa_alloc directly.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-03-27 14:08:04 -05:00
Andreas Gruenbacher
969183bc68 gfs2: Switch to list_{first,last}_entry
Replace open-coded versions of list_first_entry and list_last_entry with those
functions.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-03-27 14:08:04 -05:00
Andreas Gruenbacher
40e7e86ef1 gfs2: Clean up inode initialization and teardown
When allocating a new inode, mark the iopen glock holder as uninitialized to
make sure gfs2_evict_inode won't fail after an incomplete create or lookup.  In
gfs2_evict_inode, allow the inode glock to be NULL and remove the duplicate
iopen glock teardown code.  In gfs2_inode_lookup, don't tear down things that
gfs2_evict_inode will already tear down.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-03-27 14:08:04 -05:00
Bob Peterson
490031281d gfs2: Additional information when gfs2_ail1_flush withdraws
Before this patch, if gfs2_ail1_flush gets an error from function
gfs2_ail1_start_one (which comes indirectly from generic_writepages)
the file system is withdrawn, but without any explanation why.

This patch adds an error message if gfs2_ail1_flush gets an error
from gfs2_ail1_start_one.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-03-06 10:15:03 -06:00
Bob Peterson
cc44457f16 gfs2: leaf_dealloc needs to allocate one more revoke
Function leaf_dealloc was not allocating enough journal space for
revokes. Before, it allocated 'l_blocks' revokes, but it needs one
more for the revoke of the dinode that is modified. This patch adds
the needed revoke entry to function leaf_dealloc.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
c9ebc4b737 gfs2: allow journal replay to hold sd_log_flush_lock
Before this patch, journal replays could stomp on log flushes
and each other because both log flushes and journal replays used
the same sd_log_bio. Function gfs2_log_flush prevents other log
flushes from interfering by taking the sd_log_flush_lock rwsem
during the flush. However, it does not protect against journal
replays. This patch allows the journal replay to take the same
sd_log_flush_lock rwsem so use of the sd_log_bio is not stomped.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
019dd669bd gfs2: don't allow releasepage to free bd still used for revokes
Before this patch, function gfs2_releasepage would free any bd
elements that had been used for the page being released. However,
those bd elements may still be queued to the sd_log_revokes list,
in which case we cannot free them until the revoke has been issued.

This patch adds additional checks for bds that are still being
used for revokes.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
ca399c96e9 gfs2: flesh out delayed withdraw for gfs2_log_flush
Function gfs2_log_flush() had a few places where it tried to withdraw
from the file system when errors were encountered. The problem is,
it should delay those withdraws until the log flush lock is no longer
held.

This patch creates a new function just for delayed withdraws for
situations like this. If errors=panic was specified on mount, we
still want to do it the old fashioned way because the panic it does
not help to delay in that situation.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
1c634f94c3 gfs2: Do proper error checking for go_sync family of glops functions
Before this patch, function do_xmote would try to sync out the glock
dirty data by calling the appropriate glops function XXX_go_sync()
but it did not check for a good return code. If the sync was not
possible due to an io error or whatever, do_xmote would continue on
and call go_inval and release the glock to other cluster nodes.
When those nodes go to replay the journal, they may already be holding
glocks for the journal records that should have been synced, but were
not due to the ignored error.

This patch introduces proper error code checking to the go_sync
family of glops functions.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
df5db5f9ee gfs2: Don't demote a glock until its revokes are written
Before this patch, run_queue would demote glocks based on whether
there are any more holders. But if the glock has pending revokes that
haven't been written to the media, giving up the glock might end in
file system corruption if the revokes never get written due to
io errors, node crashes and fences, etc. In that case, another node
will replay the metadata blocks associated with the glock, but
because the revoke was never written, it could replay that block
even though the glock had since been granted to another node who
might have made changes.

This patch changes the logic in run_queue so that it never demotes
a glock until its count of pending revokes reaches zero.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
2ca0c2fbf3 gfs2: drain the ail2 list after io errors
Before this patch, gfs2_logd continually tried to flush its journal
log, after the file system is withdrawn. We don't want to write anything
to the journal, lest we add corruption. Best course of action is to
drain the ail1 into the ail2 list (via gfs2_ail1_empty) then drain the
ail2 list with a new function, ail2_drain.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
b1676cbb11 gfs2: Withdraw in gfs2_ail1_flush if write_cache_pages fails
Before this patch, function gfs2_ail1_start_one would return any
errors it received from write_cache_pages (except -EBUSY) but it did
not withdraw. Since function gfs2_ail1_flush just checks for the bad
return code and loops, the loop might potentially never end.
This patch adds some logic to allow it to exit the loop and withdraw
properly when errors are received from write_cache_pages.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
9ff7828935 gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty
Before this patch, if gfs2_ail_empty_gl saw there was nothing on
the ail list, it would return and not flush the log. The problem
is that there could still be a revoke for the rgrp sitting on the
sd_log_le_revoke list that's been recently taken off the ail list.
But that revoke still needs to be written, and the rgrp_go_inval
still needs to call log_flush_wait to ensure the revokes are all
properly written to the journal before we relinquish control of
the glock to another node. If we give the glock to another node
before we have this knowledge, the node might crash and its journal
replayed, in which case the missing revoke would allow the journal
replay to replay the rgrp over top of the rgrp we already gave to
another node, thus overwriting its changes and corrupting the
file system.

This patch makes gfs2_ail_empty_gl still call gfs2_log_flush rather
than returning.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
d93ae386ef gfs2: Check for log write errors before telling dlm to unlock
Before this patch, function do_xmote just assumed all the writes
submitted to the journal were finished and successful, and it
called the go_unlock function to release the dlm lock. But if
they're not, and a revoke failed to make its way to the journal,
a journal replay on another node will cause corruption if we
let the go_inval function continue and tell dlm to release the
glock to another node. This patch adds a couple checks for errors
in do_xmote after the calls to go_sync and go_inval. If an error
is found, we cannot withdraw yet, because the withdraw itself
uses glocks to make the file system read-only. Instead, we flag
the error. Later, asserts should cause another node to replay
the journal before continuing, thus protecting rgrp and dinode
glocks and maintaining the integrity of the metadata. Note that
we only need to do this for journaled glocks. System glocks
should be able to progress even under withdrawn conditions.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
f05b86db31 gfs2: Prepare to withdraw as soon as an IO error occurs in log write
Before this patch, function gfs2_end_log_write would detect any IO
errors writing to the journal and put out an appropriate message,
but it never set a withdrawing condition. Eventually, the log daemon
would see the error and determine it was time to withdraw, but in
the meantime, other processes could continue running as if nothing
bad ever happened. The biggest consequence is that __gfs2_glock_put
would BUG() when it saw that there were still unwritten items.

This patch sets the WITHDRAWING status as soon as an IO error is
detected, and that way, the BUG will be avoided so the file system
can be properly withdrawn and unmounted.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
5e4c7632aa gfs2: Issue revokes more intelligently
Before this patch, function gfs2_write_revokes would call
gfs2_ail1_empty, then traverse the sd_ail1_list looking for
transactions that had bds which were no longer queued to a glock.
And if it found some, it would try to issue revokes for them, up to
a predetermined maximum. There were two problems with how it did
this. First was the fact that gfs2_ail1_empty moves transactions
which have nothing remaining on the ail1 list from the sd_ail1_list
to the sd_ail2_list, thus making its traversal of sd_ail1_list
miss them completely, and therefore, never issue revokes for them.
Second was the fact that there were three traversals (or partial
traversals) of the sd_ail1_list, each of which took and then
released the sd_ail_lock lock: First inside gfs2_ail1_empty,
second to determine if there are any revokes to be issued, and
third to actually issue them. All this taking and releasing of the
sd_ail_lock meant other processes could modify the lists and the
conditions in which we're working.

This patch simplies the whole process by adding a new parameter
to function gfs2_ail1_empty, max_revokes. For normal calls, this
is passed in as 0, meaning we don't want to issue any revokes.
For function gfs2_write_revokes, we pass in the maximum number
of revokes we can, thus allowing gfs2_ail1_empty to add the
revokes where needed. This simplies the code, allows for a single
holding of the sd_ail_lock, and allows gfs2_ail1_empty to add
revokes for all the necessary bd items without missing any.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:18 -06:00
Bob Peterson
7d9f924958 gfs2: Add verbose option to check_journal_clean
Before this patch, function check_journal_clean would give messages
related to journal recovery. That's fine for mount time, but when a
node withdraws and forces replay that way, we don't want all those
distracting and misleading messages. This patch adds a new parameter
to make those messages optional.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:17 -06:00
Bob Peterson
33dbd1e41a gfs2: fix infinite loop when checking ail item count before go_inval
Before this patch, the rgrp_go_inval and inode_go_inval functions each
checked if there were any items left on the ail count (by way of a
count), and if so, did a withdraw. But the withdraw code now uses
glocks when changing the file system to read-only status. So we can
not have glock functions withdrawing or a hang will likely result:
The glocks can't be serviced by the work_func if the work_func is
busy doing its own withdraw.

This patch removes the checks from the go_inval functions and adds
a centralized check in do_xmote to warn about the problem and not
withdraw, but flag the error so it's eventually caught when the logd
daemon eventually runs.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-27 07:53:17 -06:00
Bob Peterson
601ef0d52e gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.

This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.

The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.

Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.

The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.

Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.

Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.

We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.

Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.

For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-27 07:53:12 -06:00
Bob Peterson
a72d2401f5 gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.

Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-20 11:01:36 -06:00
Bob Peterson
0d91061a37 gfs2: move check_journal_clean to util.c for future use
Before this patch function check_journal_clean was in ops_fstype.c.
This patch moves it to util.c so we can make use of it elsewhere
in a future patch.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-10 07:39:51 -06:00
Bob Peterson
03678a99d1 gfs2: Ignore dlm recovery requests if gfs2 is withdrawn
When a node fails, user space informs dlm of the node failure,
and dlm instructs gfs2 on the surviving nodes to perform journal
recovery. It does this by calling various callback functions in
lock_dlm.c. To mark its progress, it keeps generation numbers
and recover bits in a dlm "control" lock lvb, which is seen by
all nodes to determine which journals need to be replayed.

The gfs2 on all nodes get the same recovery requests from dlm,
so they all try to do the recovery, but only one will be
granted the exclusive lock on the journal. The others fail
with a "Busy" message on their "try lock."

However, when a node is withdrawn, it cannot safely do any
recovery or replay any journals. To make matters worse,
gfs2 might withdraw as a result of attempting recovery. For
example, this might happen if the device goes offline, or if
an hba fails. But in today's gfs2 code, it doesn't check for
being withdrawn at any step in the recovery process. What's
worse is that these callbacks from dlm have no return code,
so there is no way to indicate failure back to dlm. We can
send a "Recovery failed" uevent eventually, but that tells
user space what happened, not dlm's kernel code.

Before this patch, lock_dlm would perform its recovery steps but
ignore the result, and eventually it would still update its
generation number in the lvb, despite the fact that it may have
withdrawn or encountered an error. The other nodes would then
see the newer generation number in the lvb and conclude that
they don't need to do recovery because the generation number
is newer than the last one they saw. They think a different
node has already recovered the journal.

This patch adds checks to several of the callbacks used by dlm
in its recovery state machine so that the functions are ignored
and skipped if an io error has occurred or if the file system
is withdrawn. That prevents the lvb bits from being updated, and
therefore dlm and user space still see the need for recovery to
take place.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-10 07:39:50 -06:00
Bob Peterson
f34a6135ce gfs2: Only complain the first time an io error occurs in quota or log
Before this patch, all io errors received by the quota daemon or the
logd daemon would cause a complaint message to be issued, such as:

   gfs2: fsid=dm-13.0: Error 10 writing to journal, jid=0

This patch changes it so that the error message is only issued the
first time the error is encountered.

Also, before this patch function gfs2_end_log_write did not set the
sd_log_error value, so log errors would not cause the file system to
be withdrawn. This patch sets the error code so the file system is
properly withdrawn if an io error is encountered writing to the journal.

WARNING: This change in function breaks check xfstests generic/441
and causes it to fail: io errors writing to the log should cause a
file system to be withdrawn, and no further operations are tolerated.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-10 07:39:50 -06:00
Bob Peterson
036330c914 gfs2: log error reform
Before this patch, gfs2 kept track of journal io errors in two
places sd_log_error and the SDF_AIL1_IO_ERROR flag in sd_flags.
This patch consolidates the two into sd_log_error so that it
reflects the first error encountered writing to the journal.
In future patches, we will take advantage of this by checking
this value rather than having to check both when reacting to
io errors.

In addition, this fixes a tight loop in unmount: If buffers
get on the ail1 list and an io error occurs elsewhere, the
ail1 list would never be cleared because they were always busy.
So unmount would hang, waiting for the ail1 list to empty.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-10 07:39:49 -06:00
Bob Peterson
b3422cacdd gfs2: Rework how rgrp buffer_heads are managed
Before this patch, the rgrp code had a serious problem related to
how it managed buffer_heads for resource groups. The problem caused
file system corruption, especially in cases of journal replay.

When an rgrp glock was demoted to transfer ownership to a
different cluster node, do_xmote() first calls rgrp_go_sync and then
rgrp_go_inval, as expected. When it calls rgrp_go_sync, that called
gfs2_rgrp_brelse() that dropped the buffer_head reference count.
In most cases, the reference count went to zero, which is right.
However, there were other places where the buffers are handled
differently.

After rgrp_go_sync, do_xmote called rgrp_go_inval which called
gfs2_rgrp_brelse a second time, then rgrp_go_inval's call to
truncate_inode_pages_range would get rid of the pages in memory,
but only if the reference count drops to 0.

Unfortunately, gfs2_rgrp_brelse was setting bi->bi_bh = NULL.
So when rgrp_go_sync called gfs2_rgrp_brelse, it lost the pointer
to the buffer_heads in cases where the reference count was still 1.
Therefore, when rgrp_go_inval called gfs2_rgrp_brelse a second time,
it failed the check for "if (bi->bi_bh)" and thus failed to call
brelse a second time. Because of that, the reference count on those
buffers sometimes failed to drop from 1 to 0. And that caused
function truncate_inode_pages_range to keep the pages in page cache
rather than freeing them.

The next time the rgrp glock was acquired, the metadata read of
the rgrp buffers re-used the pages in memory, which were now
wrong because they were likely modified by the other node who
acquired the glock in EX (which is why we demoted the glock).
This re-use of the page cache caused corruption because changes
made by the other nodes were never seen, so the bitmaps were
inaccurate.

For some reason, the problem became most apparent when journal
replay forced the replay of rgrps in memory, which caused newer
rgrp data to be overwritten by the older in-core pages.

A big part of the problem was that the rgrp buffer were released
in multiple places: The go_unlock function would release them when
the glock was released rather than when the glock is demoted,
which is clearly wrong because our intent was to cache them until
the glock is demoted from SH or EX.

This patch attempts to clean up the mess and make one consistent
and centralized mechanism for managing the rgrp buffer_heads by
implementing several changes:

1. It eliminates the call to gfs2_rgrp_brelse() from rgrp_go_sync.
   We don't want to release the buffers or zero the pointers when
   syncing for the reasons stated above. It only makes sense to
   release them when the glock is actually invalidated (go_inval).
   And when we do, then we set the bh pointers to NULL.
2. The go_unlock function (which was only used for rgrps) is
   eliminated, as we've talked about doing many times before.
   The go_unlock function was called too early in the glock dq
   process, and should not happen until the glock is invalidated.
3. It also eliminates the call to rgrp_brelse in gfs2_clear_rgrpd.
   That will now happen automatically when the rgrp glocks are
   demoted, and shouldn't happen any sooner or later than that.
   Instead, function gfs2_clear_rgrpd has been modified to demote
   the rgrp glocks, and therefore, free those pages, before the
   remaining glocks are culled by gfs2_gl_hash_clear. This
   prevents the gl_object from hanging around when the glocks are
   culled.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-10 07:39:48 -06:00
Bob Peterson
30fe70a85a gfs2: clear ail1 list when gfs2 withdraws
This patch fixes a bug in which function gfs2_log_flush can get into
an infinite loop when a gfs2 file system is withdrawn. The problem
is the infinite loop "for (;;)" in gfs2_log_flush which would never
finish because the io error and subsequent withdraw prevented the
items from being taken off the ail list.

This patch tries to clean up the mess by allowing withdraw situations
to move not-in-flight buffer_heads to the ail2 list, where they will
be dealt with later.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-10 07:39:48 -06:00
Bob Peterson
69511080bd gfs2: Introduce concept of a pending withdraw
File system withdraws can be delayed when inconsistencies are
discovered when we cannot withdraw immediately, for example, when
critical spin_locks are held. But delaying the withdraw can cause
gfs2 to ignore the error and keep running for a short period of time.
For example, an rgrp glock may be dequeued and demoted while there
are still buffers that haven't been properly revoked, due to io
errors writing to the journal.

This patch introduces a new concept of a pending withdraw, which
means an inconsistency has been discovered and we need to withdraw
at the earliest possible opportunity. In these cases, we aren't
quite withdrawn yet, but we still need to not dequeue glocks and
other critical things. If we dequeue the glocks and the withdraw
results in our journal being replayed, the replay could overwrite
data that's been modified by a different node that acquired the
glock in the meantime.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-10 07:39:47 -06:00
Andreas Gruenbacher
8e28ef1f2f gfs2: Return bool from gfs2_assert functions
The gfs2_assert functions only print messages when the filesystem hasn't been
withdrawn yet, and they indicate whether or not they've printed something in
their return value.  However, none of the callers use that information, so
simply return whether or not the assert has failed.

(The gfs2_assert functions are still backwards; they return false when an
assertion is true.)

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-10 07:39:47 -06:00
Andreas Gruenbacher
a5ca2f1cb6 gfs2: Turn gfs2_consist into void functions
Change the various gfs2_consist functions to return void.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-10 07:39:46 -06:00
Andreas Gruenbacher
d7e7ab3f1e gfs2: Remove usused cluster_wide arguments of gfs2_consist functions
These arguments are always passed as 0, and they are never evaluated.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-10 07:39:45 -06:00
Andreas Gruenbacher
8dc88ac68d gfs2: Report errors before withdraw
In gfs2_rgrp_verify and compute_bitstructs, make sure to report errors before
withdrawing the filesystem: otherwise, when we withdraw first and withdraw is
configured to panic, we'll never get to the error reporting.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-10 07:39:45 -06:00
Andreas Gruenbacher
badb55ec20 gfs2: Split gfs2_lm_withdraw into two functions
Split gfs2_lm_withdraw into a function that prints an error message and a
function that withdraws the filesystem.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-02-10 07:39:44 -06:00
Andreas Gruenbacher
6e5e41e2dc gfs2: fix O_SYNC write handling
In gfs2_file_write_iter, for direct writes, the error checking in the buffered
write fallback case is incomplete.  This can cause inode write errors to go
undetected.  Fix and clean up gfs2_file_write_iter along the way.

Based on a proposed fix by Christoph Hellwig <hch@lst.de>.

Fixes: 967bcc91b0 ("gfs2: iomap direct I/O support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-06 18:49:41 +01:00
Christoph Hellwig
4c0e8dda60 gfs2: move setting current->backing_dev_info
Set current->backing_dev_info just around the buffered write calls to
prepare for the next fix.

Fixes: 967bcc91b0 ("gfs2: iomap direct I/O support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-06 17:35:23 +01:00
Abhi Das
7582026f6f gfs2: fix gfs2_find_jhead that returns uninitialized jhead with seq 0
When the first log header in a journal happens to have a sequence
number of 0, a bug in gfs2_find_jhead() causes it to prematurely exit,
and return an uninitialized jhead with seq 0. This can cause failures
in the caller. For instance, a mount fails in one test case.

The correct behavior is for it to continue searching through the journal
to find the correct journal head with the highest sequence number.

Fixes: f4686c26ec ("gfs2: read journal in large chunks")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-06 17:35:23 +01:00
Linus Torvalds
a62aa6f7f5 Changes in gfs2:
- Fix some corner cases on filesystems with a block size < page size.
 - Fix a corner case that could expose incorrect access times over nfs.
 - Revert an otherwise sensible revoke accounting cleanup that causes
   assertion failures.  The revoke accounting is whacky and needs to be
   fixed properly before we can add back this cleanup.
 - Various other minor cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJeNEWRAAoJENW/n+sDE2U6EBIP/1OMCdASk1ykAzhoO3UsI8Fa
 9ktBZ7/tnmixHvPbAkPi+FLXFqfAOXRnY8tcWKdpJ5Mdesfhdm9HbSr8BtRf8l9r
 LqdQvd4F2lVemtpPIiU8MojSpmoJXs6shZvIxrLzUS5JaDVUxmoLc066otdTFya8
 FASuOzPxpCE6Rfdk+f+tBYF+UsBXY8w3w/hOiITKcLkqbCO8xnMuJzspPSs64qDN
 LtYJqJLe7THan2wEW20gtqpkZDX+WWYuBhdTWqZYs1Rfg17/ohcBYSB2kuvonLlq
 C4P/aS56U4stD+BefvI11iMnKlv6QQ+4KD1A7QWayvIPv9Lu1kvb/L7F/gamlUMM
 5LPI/3J0GSgK1fh3nDULYUXGvJqqBp2klabNAOCRA7lhZUBjU/wGTTDll6OM+K2O
 0K6HgyVp92xAyWhi/0LVRb3cA7xru5kZK+vzzdloTweCuPFMiF3IlYI4MMr3I1V1
 +/4DvpzqMjHx6t9sXb2oDjnvljedB4n3fH9YCVpl6wD/PnJ60gXm4tonrfLzkSII
 axfc5bEXaKXp43PuH9wbHsuNeOEGvEY7PG+FYbH5cV2BNGWviXYEGrNHop4gOQp9
 2bDZQJGEYvrvnjGe82ne7gmtDwRjp3+ovx/EBNp72G4mMb4igDnAAuSd0uzumatM
 9zGAa5+le4ZlpjNrJbWO
 =oo06
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:

 - Fix some corner cases on filesystems with a block size < page size.

 - Fix a corner case that could expose incorrect access times over nfs.

 - Revert an otherwise sensible revoke accounting cleanup that causes
   assertion failures. The revoke accounting is whacky and needs to be
   fixed properly before we can add back this cleanup.

 - Various other minor cleanups.

In addition, please expect to see another pull request from Bob Peterson
about his gfs2 recovery patch queue shortly.

* tag 'gfs2-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  Revert "gfs2: eliminate tr_num_revoke_rm"
  gfs2: remove unused LBIT macros
  fs/gfs2: remove unused IS_DINODE and IS_LEAF macros
  gfs2: Remove GFS2_MIN_LVB_SIZE define
  gfs2: Fix incorrect variable name
  gfs2: Avoid access time thrashing in gfs2_inode_lookup
  gfs2: minor cleanup: remove unneeded variable ret in gfs2_jdata_writepage
  gfs2: eliminate ssize parameter from gfs2_struct2blk
  gfs2: Another gfs2_find_jhead fix
2020-01-31 13:07:16 -08:00
Linus Torvalds
677b60dcb6 New code for 5.6:
- Fix an off-by-one error when checking if offset is within inode size
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl4vLo4ACgkQ+H93GTRK
 tOtFNg/+I0IVdJgzMPe6PoEhbMc80c1CK7yxjprtq6I1WV5/OVbpNSnXqym8vYPs
 atp+8loh9LMMYhi8ln0DyU0clu8p97+7D/aUqkN4DzKJ3t0T/LFR8NPHh7P7H3i/
 rI0Ym8pTorueC3qgcVzgP2Xx8+yoURvlRV0VSd8C9Vza9E3g7U+8cAAzZz/iMbpy
 QbW32mrEWWtACLqVEj94+sNj9O4wHy1kefkBITWBkdmaWvqP6KBWJsdQrySHNHN4
 BXEwOrfxM5w8isd9hOd3U6NPkY0I7EGq7K4aQhRzyoIw9LJ+APjpCvY1KpmFs1Ar
 7yOM026MdZKGguL1cZcgXzFpIKbkv4xj4231RD6yKXUk1XhMIs+7JO+ontRekPg3
 2oZgPJoz+5QXtcpJifKzwzwr32WFssX/HhYFlzbbq1SRlJBWEyKfPBXY8yoeAJJb
 lHK/fjsPwGeTgMG0/n3+r/GhO+3FDzDosFYHpD0IZIX08770tam7nB4hlGrJEel1
 74/cxxphNeU1PkjV43V4vxTNucEkFNn+CDD9/9MvEx8ZXtz01HkIbEsANdyZM7yj
 SO0wjkHQjkm9YlBP+lh03ign4YQYYVQLN2zdw3VzK2QYo0GKln+o/wWrXhJksZ8t
 AT70yqYx2lntnkasvMPO02pgZulorM605mgUGkbwslcmuifvpo0=
 =iPhi
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap fix from Darrick Wong:
 "A single patch fixing an off-by-one error when we're checking to see
  how far we're gotten into an EOF page"

* tag 'iomap-5.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  fs: Fix page_mkwrite off-by-one errors
2020-01-31 12:58:12 -08:00
Linus Torvalds
7eec11d3a7 Merge branch 'akpm' (patches from Andrew)
Pull updates from Andrew Morton:
 "Most of -mm and quite a number of other subsystems: hotfixes, scripts,
  ocfs2, misc, lib, binfmt, init, reiserfs, exec, dma-mapping, kcov.

  MM is fairly quiet this time.  Holidays, I assume"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
  kcov: ignore fault-inject and stacktrace
  include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()
  execve: warn if process starts with executable stack
  reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
  init/main.c: fix misleading "This architecture does not have kernel memory protection" message
  init/main.c: fix quoted value handling in unknown_bootoption
  init/main.c: remove unnecessary repair_env_string in do_initcall_level
  init/main.c: log arguments and environment passed to init
  fs/binfmt_elf.c: coredump: allow process with empty address space to coredump
  fs/binfmt_elf.c: coredump: delete duplicated overflow check
  fs/binfmt_elf.c: coredump: allocate core ELF header on stack
  fs/binfmt_elf.c: make BAD_ADDR() unlikely
  fs/binfmt_elf.c: better codegen around current->mm
  fs/binfmt_elf.c: don't copy ELF header around
  fs/binfmt_elf.c: fix ->start_code calculation
  fs/binfmt_elf.c: smaller code generation around auxv vector fill
  lib/find_bit.c: uninline helper _find_next_bit()
  lib/find_bit.c: join _find_next_bit{_le}
  uapi: rename ext2_swab() to swab() and share globally in swab.h
  lib/scatterlist.c: adjust indentation in __sg_alloc_table
  ...
2020-01-31 12:16:36 -08:00
Alexey Dobriyan
47a2ebb7f5 execve: warn if process starts with executable stack
There were few episodes of silent downgrade to an executable stack over
years:

1) linking innocent looking assembly file will silently add executable
   stack if proper linker options is not given as well:

	$ cat f.S
	.intel_syntax noprefix
	.text
	.globl f
	f:
	        ret

	$ cat main.c
	void f(void);
	int main(void)
	{
	        f();
	        return 0;
	}

	$ gcc main.c f.S
	$ readelf -l ./a.out
	  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                         0x0000000000000000 0x0000000000000000  RWE    0x10
			 					 ^^^

2) converting C99 nested function into a closure
   https://nullprogram.com/blog/2019/11/15/

	void intsort2(int *base, size_t nmemb, _Bool invert)
	{
	    int cmp(const void *a, const void *b)
	    {
	        int r = *(int *)a - *(int *)b;
	        return invert ? -r : r;
	    }
	    qsort(base, nmemb, sizeof(*base), cmp);
	}

will silently require stack trampolines while non-closure version will
not.

Without doubt this behaviour is documented somewhere, add a warning so
that developers and users can at least notice.  After so many years of
x86_64 having proper executable stack support it should not cause too
many problems.

Link: http://lkml.kernel.org/r/20191208171918.GC19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Will Deacon <will@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:41 -08:00
Yunfeng Ye
aacee5446a reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
The variable inode may be NULL in reiserfs_insert_item(), but there is
no check before accessing the member of inode.

Fix this by adding NULL pointer check before calling reiserfs_debug().

Link: http://lkml.kernel.org/r/79c5135d-ff25-1cc9-4e99-9f572b88cc00@huawei.com
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Cc: zhengbin <zhengbin13@huawei.com>
Cc: Hu Shiyuan <hushiyuan@huawei.com>
Cc: Feilong Lin <linfeilong@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:41 -08:00
Alexey Dobriyan
1fbede6e6f fs/binfmt_elf.c: coredump: allow process with empty address space to coredump
Unmapping whole address space at once with

	munmap(0, (1ULL<<47) - 4096)

or equivalent will create empty coredump.

It is silly way to exit, however registers content may still be useful.

The right to coredump is fundamental right of a process!

Link: http://lkml.kernel.org/r/20191222150137.GA1277@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:41 -08:00
Alexey Dobriyan
28f46656ad fs/binfmt_elf.c: coredump: delete duplicated overflow check
array_size() macro will do overflow check anyway.

Link: http://lkml.kernel.org/r/20191222144009.GB24341@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:41 -08:00
Alexey Dobriyan
225a3f53e7 fs/binfmt_elf.c: coredump: allocate core ELF header on stack
Comment says ELF header is "too large to be on stack".  64 bytes on
64-bit is not large by any means.

Link: http://lkml.kernel.org/r/20191222143850.GA24341@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:41 -08:00
Alexey Dobriyan
18676ffcee fs/binfmt_elf.c: make BAD_ADDR() unlikely
If some mapping goes past TASK_SIZE it will be rejected by kernel which
means no such userspace binaries exist.

Mark every such check as unlikely.

Link: http://lkml.kernel.org/r/20191215124355.GA21124@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:41 -08:00
Alexey Dobriyan
03c6d723ee fs/binfmt_elf.c: better codegen around current->mm
"current->mm" pointer is stable in general except few cases one of which
execve(2).  Compiler can't treat is as stable but it _is_ stable most of
the time.  During ELF loading process ->mm becomes stable right after
flush_old_exec().

Help compiler by caching current->mm, otherwise it continues to refetch
it.

	add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-141 (-141)
	Function                                     old     new   delta
	elf_core_dump                               5062    5039     -23
	load_elf_binary                             5426    5308    -118

Note: other cases are left as is because it is either pessimisation or
no change in binary size.

Link: http://lkml.kernel.org/r/20191215124755.GB21124@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:41 -08:00
Alexey Dobriyan
a62c5b1b66 fs/binfmt_elf.c: don't copy ELF header around
ELF header is read into bprm->buf[] by generic execve code.

Save a memcpy and allocate just one header for the interpreter instead
of two headers (64 bytes instead of 128 on 64-bit).

Link: http://lkml.kernel.org/r/20191208171242.GA19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:41 -08:00
Alexey Dobriyan
f67ef44629 fs/binfmt_elf.c: fix ->start_code calculation
Only executable segments should be accounted to ->start_code just like
they do to ->end_code (correctly).

Link: http://lkml.kernel.org/r/20191208171410.GB19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:41 -08:00