Commit graph

950925 commits

Author SHA1 Message Date
Harshad Shirwadkar
5b849b5f96 jbd2: fast commit recovery path
This patch adds fast commit recovery support in JBD2.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-7-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21 23:22:37 -04:00
Harshad Shirwadkar
aa75f4d3da ext4: main fast-commit commit path
This patch adds main fast commit commit path handlers. The overall
patch can be divided into two inter-related parts:

(A) Metadata updates tracking

    This part consists of helper functions to track changes that need
    to be committed during a commit operation. These updates are
    maintained by Ext4 in different in-memory queues. Following are
    the APIs and their short description that are implemented in this
    patch:

    - ext4_fc_track_link/unlink/creat() - Track unlink. link and creat
      operations
    - ext4_fc_track_range() - Track changed logical block offsets
      inodes
    - ext4_fc_track_inode() - Track inodes
    - ext4_fc_mark_ineligible() - Mark file system fast commit
      ineligible()
    - ext4_fc_start_update() / ext4_fc_stop_update() /
      ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These
      functions are useful for co-ordinating inode updates with
      commits.

(B) Main commit Path

    This part consists of functions to convert updates tracked in
    in-memory data structures into on-disk commits. Function
    ext4_fc_commit() is the main entry point to commit path.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21 23:22:37 -04:00
Harshad Shirwadkar
ff780b91ef jbd2: add fast commit machinery
This functions adds necessary APIs needed in JBD2 layer for fast
commits.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-5-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21 23:22:37 -04:00
Harshad Shirwadkar
6866d7b3f2 ext4 / jbd2: add fast commit initialization
This patch adds fast commit area trackers in the journal_t
structure. These are initialized via the jbd2_fc_init() routine that
this patch adds. This patch also adds ext4/fast_commit.c and
ext4/fast_commit.h files for fast commit code that will be added in
subsequent patches in this series.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-4-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21 23:22:26 -04:00
Harshad Shirwadkar
995a3ed67f ext4: add fast_commit feature and handling for extended mount options
We are running out of mount option bits. Add handling for using
s_mount_opt2. Add ext4 and jbd2 fast commit feature flag and also add
ability to turn off the fast commit feature in Ext4.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-3-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21 23:22:26 -04:00
Harshad Shirwadkar
f5b8b297b0 doc: update ext4 and journalling docs to include fast commit feature
This patch adds necessary documentation for fast commits.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21 23:19:43 -04:00
Jan Kara
e0770e9142 ext4: Detect already used quota file early
When we try to use file already used as a quota file again (for the same
or different quota type), strange things can happen. At the very least
lockdep annotations may be wrong but also inode flags may be wrongly set
/ reset. When the file is used for two quota types at once we can even
corrupt the file and likely crash the kernel. Catch all these cases by
checking whether passed file is already used as quota file and bail
early in that case.

This fixes occasional generic/219 failure due to lockdep complaint.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201015110330.28716-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:26 -04:00
changfengnan
fc750a3b44 jbd2: avoid transaction reuse after reformatting
When ext4 is formatted with lazy_journal_init=1 and transactions from
the previous filesystem are still on disk, it is possible that they are
considered during a recovery after a crash. Because the checksum seed
has changed, the CRC check will fail, and the journal recovery fails
with checksum error although the journal is otherwise perfectly valid.
Fix the problem by checking commit block time stamps to determine
whether the data in the journal block is just stale or whether it is
indeed corrupt.

Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Fengnan Chang <changfengnan@hikvision.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201012164900.20197-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:26 -04:00
Kaixu Xia
d3e7d20bef ext4: use the normal helper to get the actual inode
Here we use the READ_ONCE to fix race conditions in ->d_compare() and
->d_hash() when they are called in RCU-walk mode, seems we can use
the normal helper d_inode_rcu() to get the actual inode.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/1602317416-1260-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:26 -04:00
Ritesh Harjani
d1e18b8824 ext4: fix bs < ps issue reported with dioread_nolock mount opt
left shifting m_lblk by blkbits was causing value overflow and hence
it was not able to convert unwritten to written extent.
So, make sure we typecast it to loff_t before do left shift operation.
Also in func ext4_convert_unwritten_io_end_vec(), make sure to initialize
ret variable to avoid accidentally returning an uninitialized ret.

This patch fixes the issue reported in ext4 for bs < ps with
dioread_nolock mount option.

Fixes: c8cc88163f ("ext4: Add support for blocksize < pagesize in dioread_nolock")
Cc: stable@kernel.org
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/af902b5db99e8b73980c795d84ad7bb417487e76.1602168865.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:15 -04:00
Mauricio Faria de Oliveira
afb585a97f ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
This implements journal callbacks j_submit|finish_inode_data_buffers()
with different behavior for data=journal: to write-protect pages under
commit, preventing changes to buffers writeably mapped to userspace.

If a buffer's content changes between commit's checksum calculation
and write-out to disk, it can cause journal recovery/mount failures
upon a kernel crash or power loss.

    [   27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!
    [   27.339492] JBD2: Invalid checksum recovering data block 8705 in log
    [   27.342716] JBD2: recovery failed
    [   27.343316] EXT4-fs (loop0): error loading journal
    mount: /ext4: can't read superblock on /dev/loop0.

In j_submit_inode_data_buffers() we write-protect the inode's pages
with write_cache_pages() and redirty w/ writepage callback if needed.

In j_finish_inode_data_buffers() there is nothing do to.

And in order to use the callbacks, inodes are added to the inode list
in transaction in __ext4_journalled_writepage() and ext4_page_mkwrite().

In ext4_page_mkwrite() we must make sure that the buffers are attached
to the transaction as jbddirty with write_end_fn(), as already done in
__ext4_journalled_writepage().

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reported-by: Dann Frazier <dann.frazier@canonical.com>
Reported-by: kernel test robot <lkp@intel.com> # wbc.nr_to_write
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201006004841.600488-5-mfo@canonical.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:15 -04:00
Mauricio Faria de Oliveira
64a9f14499 ext4: data=journal: fixes for ext4_page_mkwrite()
These are two fixes for data journalling required by
the next patch, discovered while testing it.

First, the optimization to return early if all buffers
are mapped is not appropriate for the next patch:

The inode _must_ be added to the transaction's list in
data=journal mode (so to write-protect pages on commit)
thus we cannot return early there.

Second, once that optimization to reduce transactions
was disabled for data=journal mode, more transactions
happened, and occasionally hit this warning message:
'JBD2: Spotted dirty metadata buffer'.

Reason is, block_page_mkwrite() will set_buffer_dirty()
before do_journal_get_write_access() that is there to
prevent it. This issue was masked by the optimization.

So, on data=journal use __block_write_begin() instead.
This also requires page locking and len recalculation.
(see block_page_mkwrite() for implementation details.)

Finally, as Jan noted there is little sharing between
data=journal and other modes in ext4_page_mkwrite().

However, a prototype of ext4_journalled_page_mkwrite()
showed there still would be lots of duplicated lines
(tens of) that didn't seem worth it.

Thus this patch ends up with an ugly goto to skip all
non-data journalling code (to avoid long indentations,
but that can be changed..) in the beginning, and just
a conditional in the transaction section.

Well, we skip a common part to data journalling which
is the page truncated check, but we do it again after
ext4_journal_start() when we re-acquire the page lock
(so not to acquire the page lock twice needlessly for
data journalling.)

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201006004841.600488-4-mfo@canonical.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:15 -04:00
Mauricio Faria de Oliveira
342af94ec6 jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
Introduce journal callbacks to allow different behaviors
for an inode in journal_submit|finish_inode_data_buffers().

The existing users of the current behavior (ext4, ocfs2)
are adapted to use the previously exported functions
that implement the current behavior.

Users are callers of jbd2_journal_inode_ranged_write|wait(),
which adds the inode to the transaction's inode list with
the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree.

Both CONFIG_EXT4_FS and CONFIG_OCSFS2_FS select CONFIG_JBD2,
which builds fs/jbd2/commit.c and journal.c that define and
export the functions, so we can call directly in ext4/ocfs2.

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201006004841.600488-3-mfo@canonical.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:15 -04:00
Mauricio Faria de Oliveira
aa3c0c61f6 jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()
Export functions that implement the current behavior done
for an inode in journal_submit|finish_inode_data_buffers().

No functional change.

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201006004841.600488-2-mfo@canonical.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:15 -04:00
zhangyi (F)
8394a6abf3 ext4: introduce ext4_sb_bread_unmovable() to replace sb_bread_unmovable()
Now we only use sb_bread_unmovable() to read superblock and descriptor
block at mount time, so there is no opportunity that we need to clear
buffer verified bit and also handle buffer write_io error bit. But for
the sake of unification, let's introduce ext4_sb_bread_unmovable() to
replace all sb_bread_unmovable(). After this patch, we stop using read
helpers in fs/buffer.c.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200924073337.861472-8-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:14 -04:00
zhangyi (F)
0a846f496d ext4: use ext4_sb_bread() instead of sb_bread()
We have already remove open codes that invoke helpers provide by
fs/buffer.c in all places reading metadata buffers. This patch switch to
use ext4_sb_bread() to replace all sb_bread() helpers, which is
ext4_read_bh() helper back end.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200924073337.861472-7-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:14 -04:00
zhangyi (F)
5df1d4123d ext4: introduce ext4_sb_breadahead_unmovable() to replace sb_breadahead_unmovable()
If we readahead inode tables in __ext4_get_inode_loc(), it may bypass
buffer_write_io_error() check, so introduce ext4_sb_breadahead_unmovable()
to handle this special case.

This patch also replace sb_breadahead_unmovable() in ext4_fill_super()
for the sake of unification.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200924073337.861472-6-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:14 -04:00
zhangyi (F)
60c776e50b ext4: use ext4_buffer_uptodate() in __ext4_get_inode_loc()
We have already introduced ext4_buffer_uptodate() to re-set the uptodate
bit on buffer which has been failed to write out to disk. Just remove
the redundant codes and switch to use ext4_buffer_uptodate() in
__ext4_get_inode_loc().

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20200924073337.861472-5-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:14 -04:00
zhangyi (F)
2d069c0889 ext4: use common helpers in all places reading metadata buffers
Revome all open codes that read metadata buffers, switch to use
ext4_read_bh_*() common helpers.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200924073337.861472-4-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:14 -04:00
zhangyi (F)
fa491b14cd ext4: introduce new metadata buffer read helpers
The previous patch add clear_buffer_verified() before we read metadata
block from disk again, but it's rather easy to miss clearing of this bit
because currently we read metadata buffer through different open codes
(e.g. ll_rw_block(), bh_submit_read() and invoke submit_bh() directly).
So, it's time to add common helpers to unify in all the places reading
metadata buffers instead. This patch add 3 helpers:

 - ext4_read_bh_nowait(): async read metadata buffer if it's actually
   not uptodate, clear buffer_verified bit before read from disk.
 - ext4_read_bh(): sync version of read metadata buffer, it will wait
   until the read operation return and check the return status.
 - ext4_read_bh_lock(): try to lock the buffer before read buffer, it
   will skip reading if the buffer is already locked.

After this patch, we need to use these helpers in all the places reading
metadata buffer instead of different open codes.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200924073337.861472-3-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:13 -04:00
zhangyi (F)
d9befedaaf ext4: clear buffer verified flag if read meta block from disk
The metadata buffer is no longer trusted after we read it from disk
again because it is not uptodate for some reasons (e.g. failed to write
back). Otherwise we may get below memory corruption problem in
ext4_ext_split()->memset() if we read stale data from the newly
allocated extent block on disk which has been failed to async write
out but miss verify again since the verified bit has already been set
on the buffer.

[   29.774674] BUG: unable to handle kernel paging request at ffff88841949d000
...
[   29.783317] Oops: 0002 [#2] SMP
[   29.784219] R10: 00000000000f4240 R11: 0000000000002e28 R12: ffff88842fa1c800
[   29.784627] CPU: 1 PID: 126 Comm: kworker/u4:3 Tainted: G      D W
[   29.785546] R13: ffffffff9cddcc20 R14: ffffffff9cddd420 R15: ffff88842fa1c2f8
[   29.786679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS ?-20190727_0738364
[   29.787588] FS:  0000000000000000(0000) GS:ffff88842fa00000(0000) knlGS:0000000000000000
[   29.789288] Workqueue: writeback wb_workfn
[   29.790319] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   29.790321]  (flush-8:0)
[   29.790844] CR2: 0000000000000008 CR3: 00000004234f2000 CR4: 00000000000006f0
[   29.791924] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   29.792839] RIP: 0010:__memset+0x24/0x30
[   29.793739] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   29.794256] Code: 90 90 90 90 90 90 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 033
[   29.795161] Kernel panic - not syncing: Fatal exception in interrupt
...
[   29.808149] Call Trace:
[   29.808475]  ext4_ext_insert_extent+0x102e/0x1be0
[   29.809085]  ext4_ext_map_blocks+0xa89/0x1bb0
[   29.809652]  ext4_map_blocks+0x290/0x8a0
[   29.809085]  ext4_ext_map_blocks+0xa89/0x1bb0
[   29.809652]  ext4_map_blocks+0x290/0x8a0
[   29.810161]  ext4_writepages+0xc85/0x17c0
...

Fix this by clearing buffer's verified bit if we read meta block from
disk again.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200924073337.861472-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:13 -04:00
Darrick J. Wong
af8c53c8bc ext4: limit entries returned when counting fsmap records
If userspace asked fsmap to try to count the number of entries, we cannot
return more than UINT_MAX entries because fmh_entries is u32.
Therefore, stop counting if we hit this limit or else we will waste time
to return truncated results.

Fixes: 0c9ec4beec ("ext4: support GETFSMAP ioctls")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20201001222148.GA49520@magnolia
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:13 -04:00
Chunguang Xu
addd752cff ext4: make mb_check_counter per group
Make bb_check_counter per group, so each group has the same chance
to be checked, which can expose errors more easily.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/1601292995-32205-2-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:13 -04:00
Chunguang Xu
9d1f9b2770 ext4: delete invalid comments near mb_buddy_adjust_border
The comment near mb_buddy_adjust_border seems meaningless, just
clear it.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/1601292995-32205-1-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:13 -04:00
Zhang Xiaoxu
9704a322ea ext4: fix bdev write error check failed when mount fs with ro
Consider a situation when a filesystem was uncleanly shutdown and the
orphan list is not empty and a read-only mount is attempted. The orphan
list cleanup during mount will fail with:

ext4_check_bdev_write_error:193: comm mount: Error while async write back metadata

This happens because sbi->s_bdev_wb_err is not initialized when mounting
the filesystem in read only mode and so ext4_check_bdev_write_error()
falsely triggers.

Initialize sbi->s_bdev_wb_err unconditionally to avoid this problem.

Fixes: bc71726c72 ("ext4: abort the filesystem if failed to async write metadata buffer")
Cc: stable@kernel.org
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200928020556.710971-1-zhangxiaoxu5@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:59 -04:00
Chunguang Xu
dd0db94f30 ext4: rename system_blks to s_system_blks inside ext4_sb_info
Rename system_blks to s_system_blks inside ext4_sb_info, keep
the naming rules consistent with other variables, which is
convenient for code reading and writing.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1600916623-544-2-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:59 -04:00
Chunguang Xu
ee7ed3aa0f ext4: rename journal_dev to s_journal_dev inside ext4_sb_info
Rename journal_dev to s_journal_dev inside ext4_sb_info, keep
the naming rules consistent with other variables, which is
convenient for code reading and writing.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1600916623-544-1-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:59 -04:00
Hui Su
15a119e093 jbd2: fix the comment of struct jbd2_journal_handle
the struct name was modified long ago, but the comment still
use struct handle_s.

Signed-off-by: Hui Su <sh_def@163.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200922171231.GA53120@rlk
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:59 -04:00
Zhang Qilong
2be7d717ca ext4: add trace exit in exception path.
Missing trace exit in exception path of ext4_sync_file and
ext4_ind_map_blocks.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Link: https://lore.kernel.org/r/20200921124738.23352-1-zhangqilong3@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:59 -04:00
Ritesh Harjani
9faac62d40 ext4: optimize file overwrites
In case if the file already has underlying blocks/extents allocated
then we don't need to start a journal txn and can directly return
the underlying mapping. Currently ext4_iomap_begin() is used by
both DAX & DIO path. We can check if the write request is an
overwrite & then directly return the mapping information.

This could give a significant perf boost for multi-threaded writes
specially random overwrites.
On PPC64 VM with simulated pmem(DAX) device, ~10x perf improvement
could be seen in random writes (overwrite). Also bcoz this optimizes
away the spinlock contention during jbd2 slab cache allocation
(jbd2_journal_handle). On x86 VM, ~2x perf improvement was observed.

Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/88e795d8a4d5cd22165c7ebe857ba91d68d8813e.1600401668.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:58 -04:00
Tian Tao
7eb90a2d6a ext4: remove unused including <linux/version.h>
Remove including <linux/version.h> that don't need it.

Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Link: https://lore.kernel.org/r/1600397165-42873-1-git-send-email-tiantao6@hisilicon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:58 -04:00
Constantine Sapuntzakis
acaa532687 ext4: fix superblock checksum calculation race
The race condition could cause the persisted superblock checksum
to not match the contents of the superblock, causing the
superblock to be considered corrupt.

An example of the race follows.  A first thread is interrupted in the
middle of a checksum calculation. Then, another thread changes the
superblock, calculates a new checksum, and sets it. Then, the first
thread resumes and sets the checksum based on the older superblock.

To fix, serialize the superblock checksum calculation using the buffer
header lock. While a spinlock is sufficient, the buffer header is
already there and there is precedent for locking it (e.g. in
ext4_commit_super).

Tested the patch by booting up a kernel with the patch, creating
a filesystem and some files (including some orphans), and then
unmounting and remounting the file system.

Cc: stable@kernel.org
Signed-off-by: Constantine Sapuntzakis <costa@purestorage.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200914161014.22275-1-costa@purestorage.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:23 -04:00
Dinghao Liu
c9e87161cc ext4: fix error handling code in add_new_gdb
When ext4_journal_get_write_access() fails, we should
terminate the execution flow and release n_group_desc,
iloc.bh, dind and gdb_bh.

Cc: stable@kernel.org
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20200829025403.3139-1-dinghao.liu@zju.edu.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:14 -04:00
Xiao Yang
aa2f77920b ext4: disallow modifying DAX inode flag if inline_data has been set
inline_data is mutually exclusive to DAX so enabling both of them triggers
the following issue:
------------------------------------------
# mkfs.ext4 -F -O inline_data /dev/pmem1
...
# mount /dev/pmem1 /mnt
# echo 'test' >/mnt/file
# lsattr -l /mnt/file
/mnt/file                    Inline_Data
# xfs_io -c "chattr +x" /mnt/file
# xfs_io -c "lsattr -v" /mnt/file
[dax] /mnt/file
# umount /mnt
# mount /dev/pmem1 /mnt
# cat /mnt/file
cat: /mnt/file: Numerical result out of range
------------------------------------------

Fixes: b383a73f2b ("fs/ext4: Introduce DAX inode flag")
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20200828084330.15776-1-yangx.jy@cn.fujitsu.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:13 -04:00
Nikolay Borisov
15ed2851b0 ext4: remove unused argument from ext4_(inc|dec)_count
The 'handle' argument is not used for anything so simply remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200826133116.11592-1-nborisov@suse.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:13 -04:00
Petr Malat
81e8c3c503 ext4: do not interpret high bytes if 64bit feature is disabled
Fields s_free_blocks_count_hi, s_r_blocks_count_hi and s_blocks_count_hi
are not valid if EXT4_FEATURE_INCOMPAT_64BIT is not enabled and should be
treated as zeroes.

Signed-off-by: Petr Malat <oss@malat.biz>
Link: https://lore.kernel.org/r/20200825150016.3363-1-oss@malat.biz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:13 -04:00
Randy Dunlap
b483bb7719 ext4: delete duplicated words + other fixes
Delete repeated words in fs/ext4/.
{the, this, of, we, after}

Also change spelling of "xttr" in inline.c to "xattr" in 2 places.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200805024850.12129-1-rdunlap@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:13 -04:00
Jens Axboe
766ef1e101 ext4: flag as supporting buffered async reads
ext4 uses generic_file_read_iter(), which already supports this.

Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Link: https://lore.kernel.org/r/fb90cc2d-b12c-738f-21a4-dd7a8ae0556a@kernel.dk
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:12 -04:00
Eric Biggers
cb8d53d2c9 ext4: fix leaking sysfs kobject after failed mount
ext4_unregister_sysfs() only deletes the kobject.  The reference to it
needs to be put separately, like ext4_put_super() does.

This addresses the syzbot report
"memory leak in kobject_set_name_vargs (3)"
(https://syzkaller.appspot.com/bug?extid=9f864abad79fae7c17e1).

Reported-by: syzbot+9f864abad79fae7c17e1@syzkaller.appspotmail.com
Fixes: 72ba74508b ("ext4: release sysfs kobject when failing to enable quotas on mount")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20200922162456.93657-1-ebiggers@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:12 -04:00
Jan Kara
5b3dc19dda ext4: discard preallocations before releasing group lock
ext4_mb_discard_group_preallocations() can be releasing group lock with
preallocations accumulated on its local list. Thus although
discard_pa_seq was incremented and concurrent allocating processes will
be retrying allocations, it can happen that premature ENOSPC error is
returned because blocks used for preallocations are not available for
reuse yet. Make sure we always free locally accumulated preallocations
before releasing group lock.

Fixes: 07b5b8e1ac ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200924150959.4335-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:12 -04:00
Ye Bin
70022da804 ext4: fix dead loop in ext4_mb_new_blocks
As we test disk offline/online with running fsstress, we find fsstress
process is keeping running state.
kworker/u32:3-262   [004] ...1   140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114
....
kworker/u32:3-262   [004] ...1   140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114

ext4_mb_new_blocks
repeat:
        ext4_mb_discard_preallocations_should_retry(sb, ac, &seq)
                freed = ext4_mb_discard_preallocations
                        ext4_mb_discard_group_preallocations
                                this_cpu_inc(discard_pa_seq);
                ---> freed == 0
                seq_retry = ext4_get_discard_pa_seq_sum
                        for_each_possible_cpu(__cpu)
                                __seq += per_cpu(discard_pa_seq, __cpu);
                if (seq_retry != *seq) {
                        *seq = seq_retry;
                        ret = true;
                }

As we see seq_retry is sum of discard_pa_seq every cpu, if
ext4_mb_discard_group_preallocations return zero discard_pa_seq in this
cpu maybe increase one, so condition "seq_retry != *seq" have always
been met.
Ritesh Harjani suggest to in ext4_mb_discard_group_preallocations function we
only increase discard_pa_seq when there is some PA to free.

Fixes: 07b5b8e1ac ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200916113859.1556397-3-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:12 -04:00
Ritesh Harjani
0e6895ba00 ext4: implement swap_activate aops using iomap
After moving ext4's bmap to iomap interface, swapon functionality
on files created using fallocate (which creates unwritten extents) are
failing. This is since iomap_bmap interface returns 0 for unwritten
extents and thus generic_swapfile_activate considers this as holes
and hence bail out with below kernel msg :-

[340.915835] swapon: swapfile has holes

To fix this we need to implement ->swap_activate aops in ext4
which will use ext4_iomap_report_ops. Since we only need to return
the list of extents so ext4_iomap_report_ops should be enough.

Cc: stable@kernel.org
Reported-by: Yuxuan Shui <yshuiv7@gmail.com>
Fixes: ac58e4fb03 ("ext4: move ext4 bmap to use iomap infrastructure")
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200904091653.1014334-1-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:35:54 -04:00
Linus Torvalds
a1b8638ba1 Linux 5.9-rc7 2020-09-27 14:38:10 -07:00
Linus Torvalds
16bc1d5432 Kbuild fixes for v5.9 (4th)
- Ignore compiler stubs for PPC to fix builds
 
  - Fix the usage of --target mentioned in the LLVM document
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl9wyzAVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGlGAQAIp9spj8Qd1U5lC7YVrTnYglG4rV
 BHPlNsBcV0V4i6OYunnEozO2wcyjii0WpOsxuA2o1ilDxoNSPPjPG73ySU2CNUYy
 ZMkuPER9vk/6MLFl4lyzxlSF2VkGem3Lk2M1yGvLKiiEAR6AafxZ6CdVig7kFz1H
 ZJV11ms9pufCpOVDMCnazS96I/EQn60uTfiy992Xjtkaet1tEaWh1KirH2OYAbBO
 oNQnlpHZVXp/xnexHz9z2ctMjWBGH2cDws5iDtIHTep2lYHK5i8CqOKzfKX88NRS
 txBZMJjDn/U825u+P8NHNaQsC1Vo0ECAsX/5YTjsLT2BZPG049BmTe0HNGDtVTB3
 bvSuRfG96L75qDxZusCtpk2TLKT7ntnV4bQYFXyilr2rZ0cXne10FmTOBhuEt1DQ
 r7o6/M4D2X2KpDTSIqfJRUozlyzBtmX9JBnKZIYvaC92ChhVch7b8RPeXV1IMqPS
 tCYpSszPbc/u15Bp2SPd7hZ1xVnKi+v8FzVRLJFLSHkWME61J/6x/v9c9jMu9buC
 y7vU1SNagkL2hf9YvsVWgiA5RvtYuj3o2uyMOZQiDE3h4RbxnSuM7IQKAH/hG4W0
 LE5hMB925Z5+H4gOduSC1pK9uxd6r6HpmJmNd4V2qEIqikNCfYIa6i0k/k3feZKE
 dJPqBX3wf8FsaJkU
 =kjmg
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v5.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - ignore compiler stubs for PPC to fix builds

 - fix the usage of --target mentioned in the LLVM document

* tag 'kbuild-fixes-v5.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  Documentation/llvm: Fix clang target examples
  scripts/kallsyms: skip ppc compiler stub *.long_branch.* / *.plt_branch.*
2020-09-27 12:18:57 -07:00
Linus Torvalds
f8818559ca Two fixes for the x86 interrupt code:
- Unbreak the magic 'search the timer interrupt' logic in IO/APIC code
     which got wreckaged when the core interrupt code made the state
     tracking logic stricter. That caused the interrupt line to stay masked
     after switching from IO/APIC to PIC delivery mode, which obviously
     prevents interrupts from being delivered.
 
   - Make run_on_irqstack_code() typesafe. The function argument is a void
     pointer which is then casted to 'void (*fun)(void *). This breaks
     Control Flow Integrity checking in clang. Use proper helper functions
     for the three variants reuqired.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl9wqnATHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWEHD/402XehvgOXySy/KuhMgtSAoJ/OElvo
 dnZYugCEh/6mllRbAhH0hkWpjxdhmCEYjWjf5Qqj01KYbQRSzYIhitmPyAI/Z2Dk
 CqhvRj4Hko7AQGmF97mg/VLdwn6sDIslfprgo3o0pqZcuyDd0QqUgkKLgQpvGDXe
 4l4OtiOoo2yHnr9wJiiMSVgHkXgp5QpTYAhnWV5ea9FokwO6/OLwjZAHgBh7MqNY
 6hnUQIHL/v4K3aXdxOED55Irf05Yk4OqiXKgNMHNLqd0xf+DBTqlfzvQ+omZpMXY
 sF0SML08r2f+KnBGuXlXDeu7MDRJlBO9Jxeq/mLcyLqP1K2nI2Ue3B6dy7AocKX3
 p1aUu0TUtRpDNdhMZSHFon1cQfPc8pFZSVkdq0ke2CYd+HleIrklvQzSn7gGrQqJ
 ZvWT9zs/XONXqxcMcSS1+XCQdJsRwIqbvY0Vxa68OIhZcKrsP7Ewln/sj46ghzOq
 kTbxXyG58anK2ssalJqR148tsDlZBrBlo3/Bm3sITIoAtb8m+jeuqi1fZnvmfkAC
 IiSEqY+p7CnNz9pYEA5T4GHDHE2luBA6YryclvCEAvi60XMZ+LW+T5szs6+fOjS+
 y4RK2dGgcE0ewlvm0jzWKLvKMdi7iv5c9ndtE1W4qxy/K/uXKvfHe45SpmcFAyqx
 FVjhT+kF/HKF1A==
 =BcVu
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "Two fixes for the x86 interrupt code:

   - Unbreak the magic 'search the timer interrupt' logic in IO/APIC
     code which got wreckaged when the core interrupt code made the
     state tracking logic stricter.

     That caused the interrupt line to stay masked after switching from
     IO/APIC to PIC delivery mode, which obviously prevents interrupts
     from being delivered.

   - Make run_on_irqstack_code() typesafe. The function argument is a
     void pointer which is then cast to 'void (*fun)(void *).

     This breaks Control Flow Integrity checking in clang. Use proper
     helper functions for the three variants reuqired"

* tag 'x86-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ioapic: Unbreak check_timer()
  x86/irq: Make run_on_irqstack_cond() typesafe
2020-09-27 12:15:21 -07:00
Linus Torvalds
ba25f0570b A set of clocksource/clockevents updates:
- Reset the TI/DM timer before enabling it instead of doing it the other
    way round.
 
  - Initialize the reload value for the GX6605s timer correctly so the
    hardware counter starts at 0 again after overrun.
 
  - Make error return value negative in the h8300 timer init function
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl9wqQ0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZx0EACJUIlCC54kw4CnZdxhoWu0f6tXEuip
 +Iyb8OJw56FdyHigvkPBMoF1o4a0Ax32TbYYOKntpDy67vnqkO6DV1M/Mwt8IhfO
 ey7h1t7e4y2vrXAfYN0oX1ZQAk9hkPGW5+wugEf6dbZZva7mm+jV0PfNP/yn7KWS
 n9lUrLNlPJdndSIYwj9Cto5mMQBsM7/qM8MkBR84i8GxFP2rofh4C5bD8WTnXzHd
 B8898riwkaaQmfq/Ch9Y79oMzpZXysAEYpZ3YExkQsEmi5YqZ8k6R8RD18mKQdFH
 7Kqh/025j7oKk9fopOvPjZ9sIX22gGP8C+tdy3sipYDCY0wRVNu+SPXppwl0T9ML
 JLX/D2pC20f/VUQ21yc8KgVt76g8QID4t+NV5/VdIHuxhei/4WN3hJxuI4w4Ivfn
 YK8mB5TK+R4K8Ln+GFE0zh/wfpjJe84K7r4NmDJnClD8chTVhVZHOlv5qJBZzob8
 Yd4fMFS0WufAj15ZMN55iLFEI30iubY5X1xaDD1sFrFJyO1VCj8ITH7mBtW9zW1a
 a/8LQlB5yIjLNTGVZGTCcYfyQ7+MA1EmkutD7AnFN87Zwx6FtDYGEPZq/KI3dwrw
 2qA7HTVBYoWQvSOQWt8inuXsbnqUQ2Hq2y8cIuieg333OGc1WQS6BZOeLdJWNGas
 W0JztaeFr1S3ew==
 =Htin
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "A set of clocksource/clockevents updates:

   - Reset the TI/DM timer before enabling it instead of doing it the
     other way round.

   - Initialize the reload value for the GX6605s timer correctly so the
     hardware counter starts at 0 again after overrun.

   - Make error return value negative in the h8300 timer init function"

* tag 'timers-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/drivers/timer-gx6605s: Fixup counter reload
  clocksource/drivers/timer-ti-dm: Do reset before enable
  clocksource/drivers/h8300_timer8: Fix wrong return value in h8300_8timer_init()
2020-09-27 12:11:35 -07:00
Peter Xu
d042035eaf mm/thp: Split huge pmds/puds if they're pinned when fork()
Pinned pages shouldn't be write-protected when fork() happens, because
follow up copy-on-write on these pages could cause the pinned pages to
be replaced by random newly allocated pages.

For huge PMDs, we split the huge pmd if pinning is detected.  So that
future handling will be done by the PTE level (with our latest changes,
each of the small pages will be copied).  We can achieve this by let
copy_huge_pmd() return -EAGAIN for pinned pages, so that we'll
fallthrough in copy_pmd_range() and finally land the next
copy_pte_range() call.

Huge PUDs will be even more special - so far it does not support
anonymous pages.  But it can actually be done the same as the huge PMDs
even if the split huge PUDs means to erase the PUD entries.  It'll
guarantee the follow up fault ins will remap the same pages in either
parent/child later.

This might not be the most efficient way, but it should be easy and
clean enough.  It should be fine, since we're tackling with a very rare
case just to make sure userspaces that pinned some thps will still work
even without MADV_DONTFORK and after they fork()ed.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-27 11:21:35 -07:00
Peter Xu
70e806e4e6 mm: Do early cow for pinned pages during fork() for ptes
This allows copy_pte_range() to do early cow if the pages were pinned on
the source mm.

Currently we don't have an accurate way to know whether a page is pinned
or not.  The only thing we have is page_maybe_dma_pinned().  However
that's good enough for now.  Especially, with the newly added
mm->has_pinned flag to make sure we won't affect processes that never
pinned any pages.

It would be easier if we can do GFP_KERNEL allocation within
copy_one_pte().  Unluckily, we can't because we're with the page table
locks held for both the parent and child processes.  So the page
allocation needs to be done outside copy_one_pte().

Some trick is there in copy_present_pte(), majorly the wrprotect trick
to block concurrent fast-gup.  Comments in the function should explain
better in place.

Oleg Nesterov reported a (probably harmless) bug during review that we
didn't reset entry.val properly in copy_pte_range() so that potentially
there's chance to call add_swap_count_continuation() multiple times on
the same swp entry.  However that should be harmless since even if it
happens, the same function (add_swap_count_continuation()) will return
directly noticing that there're enough space for the swp counter.  So
instead of a standalone stable patch, it is touched up in this patch
directly.

Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-27 11:21:35 -07:00
Peter Xu
7a4830c380 mm/fork: Pass new vma pointer into copy_page_range()
This prepares for the future work to trigger early cow on pinned pages
during fork().

No functional change intended.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-27 11:21:35 -07:00
Peter Xu
008cfe4418 mm: Introduce mm_struct.has_pinned
(Commit message majorly collected from Jason Gunthorpe)

Reduce the chance of false positive from page_maybe_dma_pinned() by
keeping track if the mm_struct has ever been used with pin_user_pages().
This allows cases that might drive up the page ref_count to avoid any
penalty from handling dma_pinned pages.

Future work is planned, to provide a more sophisticated solution, likely
to turn it into a real counter.  For now, make it atomic_t but use it as
a boolean for simplicity.

Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-27 11:21:35 -07:00