This patch fixes a deadlock, as follows:
Node 1 Node 2 Node 3
1)volume a and b are only mount vol a only mount vol b
mounted
2) start to mount b start to mount a
3) check hb of Node 3 check hb of Node 2
in vol a, qs_holds++ in vol b, qs_holds++
4) -------------------- all nodes' network down --------------------
5) progress of mount b the same situation as
failed, and then call Node 2
ocfs2_dismount_volume.
but the process is hung,
since there is a work
in ocfs2_wq cannot beo
completed. This work is
about vol a, because
ocfs2_wq is global wq.
BTW, this work which is
scheduled in ocfs2_wq is
ocfs2_orphan_scan_work,
and the context in this work
needs to take inode lock
of orphan_dir, because
lockres owner are Node 1 and
all nodes' nework has been down
at the same time, so it can't
get the inode lock.
6) Why can't this node be fenced
when network disconnected?
Because the process of
mount is hung what caused qs_holds
is not equal 0.
Because all works in the ocfs2_wq are relative to the super block.
The solution is to change the ocfs2_wq from global to local. In other
words, move it into struct ocfs2_super.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Xue jiufei <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since iput will take care the NULL check itself, NULL check before
calling it is redundant. So clean them up.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These uses sometimes do and sometimes don't have '\n' terminations. Make
the uses consistently use '\n' terminations and remove the newline from
the functions.
Miscellanea:
o Coalesce formats
o Realign arguments
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ocfs2_local_alloc_find_clear_bits and ocfs2_get_dentry, variable
numfound and set may be uninitialized and then used in tracepoint. In
ocfs2_xattr_block_get and ocfs2_delete_xattr_in_bucket, variable block_off
and xv may be uninitialized and then used in the following logic due to
unchecked return value.
This patch fixes these possible issues.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ocfs2_reserve_local_alloc_bits, it calls ocfs2_error if local alloc
inode bitmap used bits mismatch, but the log mistakes it as free bits.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Even if using the same jbd2 handle, we cannot rollback a transaction.
So once some error occurs after successfully allocating clusters, the
allocated clusters will never be used and it means they are lost. For
example, call ocfs2_claim_clusters successfully when expanding a file,
but failed in ocfs2_insert_extent. So we need free the allocated
clusters if they are not used indeed.
Signed-off-by: Zongxun Wang <wangzongxun@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is an issue in reserving and claiming space for localalloc, When
localalloc space is not enough, it would claim space from global_bitmap.
And if there is not enough free space in global_bitmap, the size of
claiming space would set to half of orignal size and retry.
The issue is as follows: osb->local_alloc_bits is set to half of orignal
size in ocfs2_recalc_la_window(), but ac->ac_bits_wanted is set to
osb->local_alloc_default_bits which is not changed. localalloc always
reserves and claims local_alloc_default_bits space and returns ENOSPC.
So, ac->ac_bits_wanted should be osb->local_alloc_bits which would be
changed.
Signed-off-by: Younger Liu <younger.liu@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use memweight to count the total number of bits set in memory area.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since all 4 files, localalloc.c, suballoc.c, alloc.c and
resize.c, which use DISK_ALLOC are changed to trace events,
Remove masklog DISK_ALLOC totally.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
This is the 2nd step to remove the debug info of DISK_ALLOC.
So this patch removes all mlog(0,...) from localalloc.c and adds
the corresponding tracepoints. Different mlogs have different
solutions.
1. Some are replaced with trace event directly.
2. Some are replaced while some new parameters are added.
3. Some are combined into one trace events.
4. Some redundant mlogs are removed.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
mlog_exit is used to record the exit status of a function.
But because it is added in so many functions, if we enable it,
the system logs get filled up quickly and cause too much I/O.
So actually no one can open it for a production system or even
for a test.
This patch just try to remove it or change it. So:
1. if all the error paths already use mlog_errno, it is just removed.
Otherwise, it will be replaced by mlog_errno.
2. if it is used to print some return value, it is replaced with
mlog(0,...).
mlog_exit_ptr is changed to mlog(0.
All those mlog(0,...) will be replaced with trace events later.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
ENTRY is used to record the entry of a function.
But because it is added in so many functions, if we enable it,
the system logs get filled up quickly and cause too much I/O.
So actually no one can open it for a production system or even
for a test.
So for mlog_entry_void, we just remove it.
for mlog_entry(...), we replace it with mlog(0,...), and they
will be replace by trace event later.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
In commit 6b82021b9e, we increase
our local alloc size and calculate how much megabytes we can
get according to group size and volume size.
But we also need to check the maximum bits a local alloc block
bitmap can have. With a bs=512, cs=32K, local volume with 160G,
it calculate 96MB while the maximum local alloc size is only
76M. So the bitmap will overflow and corrupt the system truncate
log file. See bug
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1262
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
ocfs2_group_bitmap_size has to handle the case when the
volume don't have discontiguous block group support. So
pass the feature_incompat in and check it.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
They all take an ocfs2_alloc_context, which has the allocation inode.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Defines the OCFS2_FEATURE_INCOMPAT_DISCONTIG_BG feature bit and modifies
struct ocfs2_group_desc for the feature.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Two tiny cleanup for allocation reservation.
1. Remove some extra codes in ocfs2_local_alloc_find_clear_bits.
2. Remove an unuseful variables in ocfs2_find_resv_lhs.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
I have observed that the current size of 8M gives us pretty poor
fragmentation on multi-threaded workloads which do lots of writes.
Generally, I can increase the size of local alloc windows and observe a
marked decrease in fragmentation, even up and beyond window sizes of 512
megabytes. This makes sense for a couple reasons - larger local alloc means
more room for reservation windows. On multi-node workloads the larger local
alloc helps as well because we don't have to do window slides as often.
Also, I removed the OCFS2_DEFAULT_LOCAL_ALLOC_SIZE constant as it is no
longer used and the comment above it was out of date.
To test fragmentation, I used a workload which launched 4 threads that did
4k writes into a series of about 140 alternating files.
With resv_level=2, and a 4k/4k file system I observed the following average
fragmentation for various localalloc= parameters:
localalloc= avg. fragmentation
8 48
32 16
64 10
120 7
On larger cluster sizes, the difference is more dramatic.
The new default size top out at 256M, which we'll only get for cluster
sizes of 32K and above.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch pulls the local alloc sizing code into localalloc.c and provides
a callout to it from ocfs2_fill_super(). Behavior is essentially unchanged
except that I correctly calculate the maximum local alloc size. The old code
in ocfs2_parse_options() calculated the max size as:
ocfs2_local_alloc_size(sb) * 8
which is correct, in bits. Unfortunately though the option passed in is in
megabytes. Ultimately, this bug made no real difference - the shrink code
would catch a too-large size and bring it down to something reasonable.
Still, it's less than efficient as-is.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Inodes are always allocated from the global bitmap now so we don't need this
any more. Also, the existing implementation bounces reservations around
needlessly.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch improves Ocfs2 allocation policy by allowing an inode to
reserve a portion of the local alloc bitmap for itself. The reserved
portion (allocation window) is advisory in that other allocation
windows might steal it if the local alloc bitmap becomes
full. Otherwise, the reservations are honored and guaranteed to be
free. When the local alloc window is moved to a different portion of
the bitmap, existing reservations are discarded.
Reservation windows are represented internally by a red-black
tree. Within that tree, each node represents the reservation window of
one inode. An LRU of active reservations is also maintained. When new
data is written, we allocate it from the inodes window. When all bits
in a window are exhausted, we allocate a new one as close to the
previous one as possible. Should we not find free space, an existing
reservation is pulled off the LRU and cannibalized.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
jbd[2]_journal_dirty_metadata() only returns 0. It's been returning 0
since before the kernel moved to git. There is no point in checking
this error.
ocfs2_journal_dirty() has been faithfully returning the status since the
beginning. All over ocfs2, we have blocks of code checking this can't
fail status. In the past few years, we've tried to avoid adding these
checks, because they are pointless. But anyone who looks at our code
assumes they are needed.
Finally, ocfs2_journal_dirty() is made a void function. All error
checking is removed from other files. We'll BUG_ON() the status of
jbd2_journal_dirty_metadata() just in case they change it someday. They
won't.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
When the local alloc file changes windows, unused bits are freed back to the
global bitmap. By defnition, those bits can not be in use by any file. Also,
the local alloc will never have been able to allocate those bits if they
were part of a previous truncate. Therefore it makes sense that we should
clear unused local alloc bits in the undo buffer so that they can be used
immediatly.
[ Modified to call it ocfs2_release_clusters() -- Joel ]
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
What we were doing before was to ask for the current window size as the
maximum allocation. This had the effect of limiting the amount of allocation
we could get for the local alloc during times when the window size was
shrunk due to fragmentation. In some cases, that could actually *increase*
fragmentation by artificially limiting the number of bits we can accept. So
while we still want to ask for a minimum number of bits equal to window
size, there is no reason why we should limit the number of bits the local
alloc should accept. Hence always allow the maximum number of local alloc
bits.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch add extent block (metadata) stealing mechanism for
extent allocation. This mechanism is same as the inode stealing.
if no room in slot specific extent_alloc, we will try to
allocate extent block from the next slot.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The next step in divorcing metadata I/O management from struct inode is
to pass struct ocfs2_caching_info to the journal functions. Thus the
journal locks a metadata cache with the cache io_lock function. It also
can compare ci_last_trans and ci_created_trans directly.
This is a large patch because of all the places we change
ocfs2_journal_access..(handle, inode, ...) to
ocfs2_journal_access..(handle, INODE_CACHE(inode), ...).
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We are really passing the inode into the ocfs2_read/write_blocks()
functions to get at the metadata cache. This commit passes the cache
directly into the metadata block functions, divorcing them from the
inode.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch removes the debugfs file local_alloc_stats as that information
is now included in the fs_state debugfs file.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The per-metadata-type ocfs2_journal_access_*() functions hook up jbd2
commit triggers and allow us to compute metadata ecc right before the
buffers are written out. This commit provides ecc for inodes, extent
blocks, group descriptors, and quota blocks. It is not safe to use
extened attributes and metaecc at the same time yet.
The ocfs2_extent_tree and ocfs2_path abstractions in alloc.c both hide
the type of block at their root. Before, it didn't matter, but now the
root block must use the appropriate ocfs2_journal_access_*() function.
To keep this abstract, the structures now have a pointer to the matching
journal_access function and a wrapper call to call it.
A few places use naked ocfs2_write_block() calls instead of adding the
blocks to the journal. We make sure to calculate their checksum and ecc
before the write.
Since we pass around the journal_access functions. Let's typedef them
in ocfs2.h.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2 code currently reads inodes off disk with a simple
ocfs2_read_block() call. Each place that does this has a different set
of sanity checks it performs. Some check only the signature. A couple
validate the block number (the block read vs di->i_blkno). A couple
others check for VALID_FL. Only one place validates i_fs_generation. A
couple check nothing. Even when an error is found, they don't all do
the same thing.
We wrap inode reading into ocfs2_read_inode_block(). This will validate
all the above fields, going readonly if they are invalid (they never
should be). ocfs2_read_inode_block_full() is provided for the places
that want to pass read_block flags. Every caller is passing a struct
inode with a valid ip_blkno, so we don't need a separate blkno argument
either.
We will remove the validation checks from the rest of the code in a
later commit, as they are no longer necessary.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_read_blocks() currently requires the CACHED flag for cached I/O.
However, that's the common case. Let's flip it around and provide an
IGNORE_CACHE flag for the special users. This has the added benefit of
cleaning up the code some (ignore_cache takes on its special meaning
earlier in the loop).
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
More than 30 callers of ocfs2_read_block() pass exactly OCFS2_BH_CACHED.
Only six pass a different flag set. Rather than have every caller care,
let's make ocfs2_read_block() take no flags and always do a cached read.
The remaining six places can call ocfs2_read_blocks() directly.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Now that synchronous readers are using ocfs2_read_blocks_sync(), all
callers of ocfs2_read_blocks() are passing an inode. Use it
unconditionally. Since it's there, we don't need to pass the
ocfs2_super either.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2 inode numbers are block numbers. For any filesystem with less
than 2^32 blocks, this is not a problem. However, when ocfs2 starts
using JDB2, it will be able to support filesystems with more than 2^32
blocks. This would result in inode numbers higher than 2^32.
The problem is that stat(2) can't handle those numbers on 32bit
machines. The simple solution is to have ocfs2 allocate all inodes
below that boundary.
The suballoc code is changed to honor an optional block limit. Only the
inode suballocator sets that limit - all other allocations stay unlimited.
The biggest trick is to grow the inode suballocator beneath that limit.
There's no point in allocating block groups that are above the limit,
then rejecting their elements later on. We want to prevent the inode
allocator from ever having block groups above the limit. This involves
a little gyration with the local alloc code. If the local alloc window
is above the limit, it signals the caller to try the global bitmap but
does not disable the local alloc file (which can be used for other
allocations).
[ Minor cleanup - removed an ML_NOTICE comment. --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
A per-mount debugfs file, "local_alloc" is created which when read will
expose live state of the nodes local alloc file. Performance impact is
minimal, only a bit of memory overhead per mount point. Still, the code is
hidden behind CONFIG_OCFS2_FS_STATS. This feature will help us debug
local alloc performance problems on a live system.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Ocfs2's local allocator disables itself for the duration of a mount point
when it has trouble allocating a large enough area from the primary bitmap.
That can cause performance problems, especially for disks which were only
temporarily full or fragmented. This patch allows for the allocator to
shrink it's window first, before being disabled. Later, it can also be
re-enabled so that any performance drop is minimized.
To do this, we allow the value of osb->local_alloc_bits to be shrunk when
needed. The default value is recorded in a mostly read-only variable so that
we can re-initialize when required.
Locking had to be updated so that we could protect changes to
local_alloc_bits. Mostly this involves protecting various local alloc values
with the osb spinlock. A new state is also added, OCFS2_LA_THROTTLED, which
is used when the local allocator is has shrunk, but is not disabled. If the
available space dips below 1 megabyte, the local alloc file is disabled. In
either case, local alloc is re-enabled 30 seconds after the event, or when
an appropriate amount of bits is seen in the primary bitmap.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Do this instead of tracking absolute local alloc size. This avoids
needless re-calculatiion of bits from bytes in localalloc.c. Additionally,
the value is now in a more natural unit for internal file system bitmap
work.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
A couple places use OCFS2_DEBUG_FS where they really mean
CONFIG_OCFS2_DEBUG_FS.
Reported-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
kmalloc() during a localalloc window move can trigger the mm to prune
the dcache which inturn can trigger the fs to delete an inode causing
it start a recursive transaction.
The fix also makes the change in kmalloc during localalloc shutdown
just to be safe.
Fixes oss bugzilla#901
http://oss.oracle.com/bugzilla/show_bug.cgi?id=901
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Inode allocation is modified to look in other nodes allocators during
extreme out of space situations. We retry our own slot when space is freed
back to the global bitmap, or whenever we've allocated more than 1024 inodes
from another slot.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In inode stealing, we no longer restrict the allocation to
happen in the local node. So it is neccessary for us to add
a new member in ocfs2_alloc_context to indicate which slot
we are using for allocation. We also modify the process of
local alloc so that this member can be used there also.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Commit 2fbe8d1ebe disabled localalloc
for local mounts. This caused issues as ocfs2 uses localalloc to
provide write locality. This patch enables localalloc for local mounts.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Local alloc is a performance optimization in ocfs2 in which a node
takes a window of bits from the global bitmap and then uses that for
all small local allocations. This window size is fixed to 8MB currently.
This patch allows users to specify the window size in MB including
disabling it by passing in 0. If the number specified is too large,
the fs will use the default value of 8MB.
mount -o localalloc=X /dev/sdX /mntpoint
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Call this the "inode_lock" now, since it covers both data and meta data.
This patch makes no functional changes.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Enable expensive bitmap scanning only if DEBUG option is enabled.
The bitmap scanning quite loads the CPU and on my machine the write
throughput of dd if=/dev/zero of=/ocfs2/file bs=1M count=500 conv=sync
improves from 37 MB/s to 45.4 MB/s in local mode...
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>