xfs: don't take a spinlock unconditionally in the DIO fastpath

Because this happens at high thread counts on high IOPS devices
doing mixed read/write AIO-DIO to a single file at about a million
iops:

   64.09%     0.21%  [kernel]            [k] io_submit_one
   - 63.87% io_submit_one
      - 44.33% aio_write
         - 42.70% xfs_file_write_iter
            - 41.32% xfs_file_dio_write_aligned
               - 25.51% xfs_file_write_checks
                  - 21.60% _raw_spin_lock
                     - 21.59% do_raw_spin_lock
                        - 19.70% __pv_queued_spin_lock_slowpath

This also happens of the IO completion IO path:

   22.89%     0.69%  [kernel]            [k] xfs_dio_write_end_io
   - 22.49% xfs_dio_write_end_io
      - 21.79% _raw_spin_lock
         - 20.97% do_raw_spin_lock
            - 20.10% __pv_queued_spin_lock_slowpath

IOWs, fio is burning ~14 whole CPUs on this spin lock.

So, do an unlocked check against inode size first, then if we are
at/beyond EOF, take the spinlock and recheck. This makes the
spinlock disappear from the overwrite fastpath.

I'd like to report that fixing this makes things go faster. It
doesn't - it just exposes the the XFS_ILOCK as the next severe
contention point doing extent mapping lookups, and that now burns
all the 14 CPUs this spinlock was burning.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
This commit is contained in:
Dave Chinner 2021-06-02 15:00:38 -07:00 committed by Darrick J. Wong
parent 5a981e4ea8
commit 977ec4ddf0

View file

@ -384,21 +384,30 @@ xfs_file_write_checks(
} }
goto restart; goto restart;
} }
/* /*
* If the offset is beyond the size of the file, we need to zero any * If the offset is beyond the size of the file, we need to zero any
* blocks that fall between the existing EOF and the start of this * blocks that fall between the existing EOF and the start of this
* write. If zeroing is needed and we are currently holding the * write. If zeroing is needed and we are currently holding the iolock
* iolock shared, we need to update it to exclusive which implies * shared, we need to update it to exclusive which implies having to
* having to redo all checks before. * redo all checks before.
* *
* We need to serialise against EOF updates that occur in IO * We need to serialise against EOF updates that occur in IO completions
* completions here. We want to make sure that nobody is changing the * here. We want to make sure that nobody is changing the size while we
* size while we do this check until we have placed an IO barrier (i.e. * do this check until we have placed an IO barrier (i.e. hold the
* hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched. * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched. The
* The spinlock effectively forms a memory barrier once we have the * spinlock effectively forms a memory barrier once we have the
* XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
* and hence be able to correctly determine if we need to run zeroing. * hence be able to correctly determine if we need to run zeroing.
*
* We can do an unlocked check here safely as IO completion can only
* extend EOF. Truncate is locked out at this point, so the EOF can
* not move backwards, only forwards. Hence we only need to take the
* slow path and spin locks when we are at or beyond the current EOF.
*/ */
if (iocb->ki_pos <= i_size_read(inode))
goto out;
spin_lock(&ip->i_flags_lock); spin_lock(&ip->i_flags_lock);
isize = i_size_read(inode); isize = i_size_read(inode);
if (iocb->ki_pos > isize) { if (iocb->ki_pos > isize) {
@ -426,7 +435,7 @@ xfs_file_write_checks(
drained_dio = true; drained_dio = true;
goto restart; goto restart;
} }
trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize); trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
error = iomap_zero_range(inode, isize, iocb->ki_pos - isize, error = iomap_zero_range(inode, isize, iocb->ki_pos - isize,
NULL, &xfs_buffered_write_iomap_ops); NULL, &xfs_buffered_write_iomap_ops);
@ -435,6 +444,7 @@ xfs_file_write_checks(
} else } else
spin_unlock(&ip->i_flags_lock); spin_unlock(&ip->i_flags_lock);
out:
return file_modified(file); return file_modified(file);
} }
@ -500,7 +510,17 @@ xfs_dio_write_end_io(
* other IO completions here to update the EOF. Failing to serialise * other IO completions here to update the EOF. Failing to serialise
* here can result in EOF moving backwards and Bad Things Happen when * here can result in EOF moving backwards and Bad Things Happen when
* that occurs. * that occurs.
*
* As IO completion only ever extends EOF, we can do an unlocked check
* here to avoid taking the spinlock. If we land within the current EOF,
* then we do not need to do an extending update at all, and we don't
* need to take the lock to check this. If we race with an update moving
* EOF, then we'll either still be beyond EOF and need to take the lock,
* or we'll be within EOF and we don't need to take it at all.
*/ */
if (offset + size <= i_size_read(inode))
goto out;
spin_lock(&ip->i_flags_lock); spin_lock(&ip->i_flags_lock);
if (offset + size > i_size_read(inode)) { if (offset + size > i_size_read(inode)) {
i_size_write(inode, offset + size); i_size_write(inode, offset + size);