Commit Graph

557 Commits

Author SHA1 Message Date
Yu Kuai f423f41b76 raid1: fix use-after-free for original bio in raid1_write_request()
commit fcf3f7e2fc upstream.

r1_bio->bios[] is used to record new bios that will be issued to
underlying disks, however, in raid1_write_request(), r1_bio->bios[]
will set to the original bio temporarily. Meanwhile, if blocked rdev
is set, free_r1bio() will be called causing that all r1_bio->bios[]
to be freed:

raid1_write_request()
 r1_bio = alloc_r1bio(mddev, bio); -> r1_bio->bios[] is NULL
 for (i = 0;  i < disks; i++) -> for each rdev in conf
  // first rdev is normal
  r1_bio->bios[0] = bio; -> set to original bio
  // second rdev is blocked
  if (test_bit(Blocked, &rdev->flags))
   break

 if (blocked_rdev)
  free_r1bio()
   put_all_bios()
    bio_put(r1_bio->bios[0]) -> original bio is freed

Test scripts:

mdadm -CR /dev/md0 -l1 -n4 /dev/sd[abcd] --assume-clean
fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 \
    -iodepth=128 -name=test -direct=1
echo blocked > /sys/block/md0/md/rd2/state

Test result:

BUG bio-264 (Not tainted): Object already free
-----------------------------------------------------------------------------

Allocated in mempool_alloc_slab+0x24/0x50 age=1 cpu=1 pid=869
 kmem_cache_alloc+0x324/0x480
 mempool_alloc_slab+0x24/0x50
 mempool_alloc+0x6e/0x220
 bio_alloc_bioset+0x1af/0x4d0
 blkdev_direct_IO+0x164/0x8a0
 blkdev_write_iter+0x309/0x440
 aio_write+0x139/0x2f0
 io_submit_one+0x5ca/0xb70
 __do_sys_io_submit+0x86/0x270
 __x64_sys_io_submit+0x22/0x30
 do_syscall_64+0xb1/0x210
 entry_SYSCALL_64_after_hwframe+0x6c/0x74
Freed in mempool_free_slab+0x1f/0x30 age=1 cpu=1 pid=869
 kmem_cache_free+0x28c/0x550
 mempool_free_slab+0x1f/0x30
 mempool_free+0x40/0x100
 bio_free+0x59/0x80
 bio_put+0xf0/0x220
 free_r1bio+0x74/0xb0
 raid1_make_request+0xadf/0x1150
 md_handle_request+0xc7/0x3b0
 md_submit_bio+0x76/0x130
 __submit_bio+0xd8/0x1d0
 submit_bio_noacct_nocheck+0x1eb/0x5c0
 submit_bio_noacct+0x169/0xd40
 submit_bio+0xee/0x1d0
 blkdev_direct_IO+0x322/0x8a0
 blkdev_write_iter+0x309/0x440
 aio_write+0x139/0x2f0

Since that bios for underlying disks are not allocated yet, fix this
problem by using mempool_free() directly to free the r1_bio.

Fixes: 992db13a4a ("md/raid1: free the r1bio before waiting for blocked rdev")
Cc: stable@vger.kernel.org # v6.6+
Reported-by: Coly Li <colyli@suse.de>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Coly Li <colyli@suse.de>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240308093726.1047420-1-yukuai1@huaweicloud.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-17 11:23:24 +02:00
Yu Kuai 10b6e8412d md/raid1: fix choose next idle in read_balance()
[ Upstream commit 257ac239ff ]

Commit 12cee5a8a2 ("md/raid1: prevent merging too large request") add
the case choose next idle in read_balance():

read_balance:
 for_each_rdev
  if(next_seq_sect == this_sector || dist == 0)
  -> sequential reads
   best_disk = disk;
   if (...)
    choose_next_idle = 1
    continue;

 for_each_rdev
 -> iterate next rdev
  if (pending == 0)
   best_disk = disk;
   -> choose the next idle disk
   break;

  if (choose_next_idle)
   -> keep using this rdev if there are no other idle disk
   contine

However, commit 2e52d449bc ("md/raid1: add failfast handling for reads.")
remove the code:

-               /* If device is idle, use it */
-               if (pending == 0) {
-                       best_disk = disk;
-                       break;
-               }

Hence choose next idle will never work now, fix this problem by
following:

1) don't set best_disk in this case, read_balance() will choose the best
   disk after iterating all the disks;
2) add 'pending' so that other idle disk will be chosen;
3) add a new local variable 'sequential_disk' to record the disk, and if
   there is no other idle disk, 'sequential_disk' will be chosen;

Fixes: 2e52d449bc ("md/raid1: add failfast handling for reads.")
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-5-yukuai1@huaweicloud.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26 18:16:29 -04:00
Yu Kuai cb38ae9ed8 md/raid1: record nonrot rdevs while adding/removing rdevs to conf
[ Upstream commit 2c27d09d3a ]

For raid1, each read will iterate all the rdevs from conf and check if
any rdev is non-rotational, then choose rdev with minimal IO inflight
if so, or rdev with closest distance otherwise.

Disk nonrot info can be changed through sysfs entry:

/sys/block/[disk_name]/queue/rotational

However, consider that this should only be used for testing, and user
really shouldn't do this in real life. Record the number of non-rotational
disks in conf, to avoid checking each rdev in IO fast path and simplify
read_balance() a little bit.

Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-4-yukuai1@huaweicloud.com
Stable-dep-of: 257ac239ff ("md/raid1: fix choose next idle in read_balance()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26 18:16:28 -04:00
Yu Kuai 9732ce695e md/raid1: factor out helpers to add rdev to conf
[ Upstream commit 969d6589ab ]

There are no functional changes, just make code cleaner and prepare to
record disk non-rotational information while adding and removing rdev to
conf

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240229095714.926789-3-yukuai1@huaweicloud.com
Stable-dep-of: 257ac239ff ("md/raid1: fix choose next idle in read_balance()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26 18:16:28 -04:00
Mikulas Patocka 9f3fe29d77 md: fix a suspicious RCU usage warning
RCU protection was removed in the commit 2d32777d60 ("raid1: remove rcu
protection to access rdev from conf").

However, the code in fix_read_error does rcu_dereference outside
rcu_read_lock - this triggers the following warning. The warning is
triggered by a LVM2 test shell/integrity-caching.sh.

This commit removes rcu_dereference.

=============================
WARNING: suspicious RCU usage
6.7.0 #2 Not tainted
-----------------------------
drivers/md/raid1.c:2265 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
no locks held by mdX_raid1/1859.

stack backtrace:
CPU: 2 PID: 1859 Comm: mdX_raid1 Not tainted 6.7.0 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x60/0x70
 lockdep_rcu_suspicious+0x153/0x1b0
 raid1d+0x1732/0x1750 [raid1]
 ? lock_acquire+0x9f/0x270
 ? finish_wait+0x3d/0x80
 ? md_thread+0xf7/0x130 [md_mod]
 ? lock_release+0xaa/0x230
 ? md_register_thread+0xd0/0xd0 [md_mod]
 md_thread+0xa0/0x130 [md_mod]
 ? housekeeping_test_cpu+0x30/0x30
 kthread+0xdc/0x110
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x28/0x40
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork_asm+0x11/0x20
 </TASK>

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: ca294b34aa ("md/raid1: support read error check")
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/51539879-e1ca-fde3-b8b4-8934ddedcbc@redhat.com
2024-01-24 22:58:00 -08:00
Bart Van Assche 7dab24554d md/raid1: Use blk_opf_t for read and write operations
Use the type blk_opf_t for read and write operations instead of int. This
patch does not affect the generated code but fixes the following sparse
warning:

drivers/md/raid1.c:1993:60: sparse: sparse: incorrect type in argument 5 (different base types)
     expected restricted blk_opf_t [usertype] opf
     got int rw

Cc: Song Liu <song@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Fixes: 3c5e514db5 ("md/raid1: Use the new blk_opf_t type")
Cc: stable@vger.kernel.org # v6.0+
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202401080657.UjFnvQgX-lkp@intel.com/
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20240108001223.23835-1-bvanassche@acm.org
2024-01-09 15:14:01 -08:00
Li Nan ca294b34aa md/raid1: support read error check
After commit 1e50915fe0 ("raid: improve MD/raid10 handling of correctable
read errors."), rdev will be set to faulty if it reads data error to many
times in raid10. Add this mechanism to raid1 now.

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231215023852.3478228-3-linan666@huaweicloud.com
2023-12-15 15:22:15 -08:00
Li Nan 1979dbbe32 md: factor out a helper exceed_read_errors() to check read_errors
Move check_decay_read_errors() to raid1-10.c and factor out a helper
exceed_read_errors() to check if read_errors exceeds the limit, so that
raid1 can also use it. There are no functional changes.

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231215023852.3478228-2-linan666@huaweicloud.com
2023-12-15 15:22:15 -08:00
Gou Hao af140f806a md/raid1: remove unnecessary null checking
If %__GFP_DIRECT_RECLAIM is set then bio_alloc_bioset will always
be able to allocate a bio. See comment of bio_alloc_bioset.

Signed-off-by: Gou Hao <gouhao@uniontech.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231214151458.28970-1-gouhao@uniontech.com
2023-12-15 00:45:37 -08:00
Yu Kuai 2d32777d60 md/raid1: remove rcu protection to access rdev from conf
Because it's safe to accees rdev from conf:
 - If any spinlock is held, because synchronize_rcu() from
   md_kick_rdev_from_array() will prevent 'rdev' to be freed until
   spinlock is released;
 - If 'reconfig_lock' is held, because rdev can't be added or removed from
   array;
 - If there is normal IO inflight, because mddev_suspend() will prevent
   rdev to be added or removed from array;
 - If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is
   checked in remove_and_add_spares().

And these will cover all the scenarios in raid1.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231125081604.3939938-4-yukuai1@huaweicloud.com
2023-11-27 15:49:05 -08:00
Yu Kuai c891f1fd90 md: remove flag RemoveSynchronized
rcu is not used correctly here, because synchronize_rcu() is called
before replacing old value, for example:

remove_and_add_spares   // other path
 synchronize_rcu
 // called before replacing old value
 set_bit(RemoveSynchronized)
                        rcu_read_lock()
                        rdev = conf->mirros[].rdev
 pers->hot_remove_disk
  conf->mirros[].rdev = NULL;
  if (!test_bit(RemoveSynchronized))
   synchronize_rcu
   /*
    * won't be called, and won't wait
    * for concurrent readers to be done.
    */
                        // access rdev after remove_and_add_spares()
                        rcu_read_unlock()

Fortunately, there is a separate rcu protection to prevent such rdev
to be freed:

md_kick_rdev_from_array		//other path
				rcu_read_lock()
				rdev = conf->mirros[].rdev
list_del_rcu(&rdev->same_set)

				rcu_read_unlock()
				/*
				 * rdev can be removed from conf, but
				 * rdev won't be freed.
				 */
synchronize_rcu()
free rdev

Hence remove this useless flag and prepare to remove rcu protection to
access rdev from 'conf'.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231125081604.3939938-2-yukuai1@huaweicloud.com
2023-11-27 15:49:04 -08:00
Yu Kuai 9e55a22fce md/raid1: don't split discard io for write behind
Currently, discad io is treated the same as normal write io, and for
write behind case, io size is limited to:

BIO_MAX_VECS * (PAGE_SIZE >> 9)

For 0.5KB sector size and 4KB PAGE_SIZE, this is just 1MB. For
consequence, if 'WriteMostly' is set to one of the underlying disks,
then diskcard io will be splited into 1MB and it will take a long time
for the diskcard to finish.

Fix this problem by disable write behind for discard io.

Reported-by: Roman Mamedov <rm@romanrm.net>
Closes: https://lore.kernel.org/all/6a1165f7-c792-c054-b8f0-1ad4f7b8ae01@ultracoder.org/
Reported-and-tested-by: Kirill Kirilenko <kirill@ultracoder.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231007112105.407449-1-yukuai1@huaweicloud.com
2023-10-09 16:32:02 -07:00
Yu Kuai b8494823e2 md: initialize 'writes_pending' while allocating mddev
Currently 'writes_pending' is initialized in pers->run for raid1/5/10,
and it's freed while deleing mddev, instead of pers->free. pers->run can
be called multiple times before mddev is deleted, and a helper
mddev_init_writes_pending() is used to prevent 'writes_pending' to be
initialized multiple times, this usage is safe but a litter weird.

On the other hand, 'writes_pending' is only initialized for raid1/5/10,
however, it's used in common layer, for example:

array_state_store
 set_in_sync
  if (!mddev->in_sync) -> in_sync is used for all levels
   // access writes_pending

There might be some implicit dependency that I don't recognized to make
sure 'writes_pending' can only be accessed for raid1/5/10, but there are
no comments about that.

By the way, it make sense to initialize 'writes_pending' in common layer
because there are already three levels use it.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-3-yukuai1@huaweicloud.com
2023-09-22 10:28:26 -07:00
Nigel Croxon df203da47f md/raid1: fix error: ISO C90 forbids mixed declarations
There is a compile error when this commit is added:
md: raid1: fix potential OOB in raid1_remove_disk()

drivers/md/raid1.c: In function 'raid1_remove_disk':
drivers/md/raid1.c:1844:9: error: ISO C90 forbids mixed declarations
and code [-Werror=declaration-after-statement]
1844 |         struct raid1_info *p = conf->mirrors + number;
     |         ^~~~~~

That's because the new code was inserted before the struct.
The change is move the struct command above this commit.

Fixes: 8b0472b50b ("md: raid1: fix potential OOB in raid1_remove_disk()")
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/46d929d0-2aab-4cf2-b2bf-338963e8ba5a@redhat.com
2023-09-11 14:41:58 -07:00
Heinz Mauelshagen 6b2460e66c md raid1: allow writebehind to work on any leg device set WriteMostly
As the WriteMostly flag can be set on any component device of a RAID1
array, remove the constraint that it only works if set on the first one.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/r/2a9592bf3340f34bf588eec984b23ee219f3985e.1692013451.git.heinzm@redhat.com
Signed-off-by: Song Liu <song@kernel.org>
2023-08-17 21:11:31 -07:00
Xueshi Hu c069da449a md/raid1: hold the barrier until handle_read_error() finishes
handle_read_error() will call allow_barrier() to match the former barrier
raising. However, it should put the allow_barrier() at the end to avoid a
concurrent raid reshape.

Fixes: 689389a06c ("md/raid1: simplify handle_read_error().")
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
Link: https://lore.kernel.org/r/20230814135356.1113639-4-xueshi.hu@smartx.com
Signed-off-by: Song Liu <song@kernel.org>
2023-08-17 21:11:30 -07:00
Xueshi Hu 992db13a4a md/raid1: free the r1bio before waiting for blocked rdev
Raid1 reshape will change mempool and r1conf::raid_disks which are
needed to free r1bio. allow_barrier() make a concurrent raid1_reshape()
possible. So, free the in-flight r1bio before waiting blocked rdev.

Fixes: 6bfe0b4990 ("md: support blocking writes to an array on device failure")
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
Link: https://lore.kernel.org/r/20230814135356.1113639-3-xueshi.hu@smartx.com
Signed-off-by: Song Liu <song@kernel.org>
2023-08-17 21:11:30 -07:00
Xueshi Hu c5d736f548 md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io()
After allow_barrier, a concurrent raid1_reshape() will replace old mempool
and r1conf::raid_disks. Move allow_barrier() to the end of raid_end_bio_io(),
so that r1bio can be freed safely.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
Link: https://lore.kernel.org/r/20230814135356.1113639-2-xueshi.hu@smartx.com
Signed-off-by: Song Liu <song@kernel.org>
2023-08-17 21:11:30 -07:00
Li Lingfeng 7eb8ff02c1 md: Hold mddev->reconfig_mutex when trying to get mddev->sync_thread
Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in
action_store"") removed the scenario of calling md_unregister_thread()
without holding mddev->reconfig_mutex, so add a lock holding check before
acquiring mddev->sync_thread by passing mdev to md_unregister_thread().

Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230803071711.2546560-1-lilingfeng@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2023-08-15 09:40:26 -07:00
Zhang Shurong 8b0472b50b md: raid1: fix potential OOB in raid1_remove_disk()
If rddev->raid_disk is greater than mddev->raid_disks, there will be
an out-of-bounds in raid1_remove_disk(). We have already found
similar reports as follows:

1) commit d17f744e88 ("md-raid10: fix KASAN warning")
2) commit 1ebc2cec0b ("dm raid: fix KASAN warning in raid5_remove_disk")

Fix this bug by checking whether the "number" variable is
valid.

Signed-off-by: Zhang Shurong <zhang_shurong@foxmail.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/tencent_0D24426FAC6A21B69AC0C03CE4143A508F09@qq.com
Signed-off-by: Song Liu <song@kernel.org>
2023-08-15 09:37:27 -07:00
Jack Wang 21bd9a68fe md/raid1: Avoid lock contention from wake_up()
wake_up is called unconditionally in a few paths such as make_request(),
which cause lock contention under high concurrency workload like below
    raid1_end_write_request
     wake_up
      __wake_up_common_lock
       spin_lock_irqsave

Improve performance by only call wake_up() if waitqueue is not empty

Fio test script:

[global]
name=random reads and writes
ioengine=libaio
direct=1
readwrite=randrw
rwmixread=70
iodepth=64
buffered=0
filename=/dev/md0
size=1G
runtime=30
time_based
randrepeat=0
norandommap
refill_buffers
ramp_time=10
bs=4k
numjobs=400
group_reporting=1
[job1]

Test result with 2 ramdisk in raid1 on a Intel Broadwell 56 cores server.

	Before this patch       With this patch
	READ	BW=4621MB/s	BW=7337MB/s
	WRITE	BW=1980MB/s	BW=3144MB/s

The patch is inspired by Yu Kuai's change for raid10:
https://lore.kernel.org/r/20230621105728.1268542-1-yukuai1@huaweicloud.com

Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230705113227.148494-1-jinpu.wang@ionos.com
Signed-off-by: Song Liu <song@kernel.org>
2023-07-27 00:13:30 -07:00
Li Nan 02c67a3b72 md: remove redundant check in fix_read_error()
In fix_read_error(), 'success' will be checked immediately after assigning
it, if it is set to 1 then the loop will break. Checking it again in
condition of loop is redundant. Clean it up.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230623173236.2513554-3-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2023-07-27 00:13:30 -07:00
Li Nan ffb1e7a03f md/raid1: prioritize adding disk to 'removed' mirror
New disk should be added to "removed" position first instead of to be a
replacement. Commit 6090368abc ("md/raid10: prioritize adding disk to
'removed' mirror") has fixed this issue for raid10. Fix it for raid1 now.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230627014332.3810102-1-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2023-07-27 00:13:30 -07:00
Yu Kuai bb2a9acefa md/raid1: switch to use md_account_bio() for io accounting
Two problems can be fixed this way:

1) 'active_io' will represent inflight io instead of io that is
dispatching.

2) If io accounting is enabled or disabled while io is still inflight,
bio_start_io_acct() and bio_end_io_acct() is not balanced and io
inflight counter will be leaked.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-5-yukuai1@huaweicloud.com
2023-07-27 00:13:29 -07:00
Yu Kuai 460af1f9d9 md/raid1-10: limit the number of plugged bio
bio can be added to plug infinitely, and following writeback test can
trigger huge amount of plugged bio:

Test script:
modprobe brd rd_nr=4 rd_size=10485760
mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean --bitmap=internal
echo 0 > /proc/sys/vm/dirty_background_ratio
fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test

Test result:
Monitor /sys/block/md0/inflight will found that inflight keep increasing
until fio finish writing, after running for about 2 minutes:

[root@fedora ~]# cat /sys/block/md0/inflight
       0  4474191

Fix the problem by limiting the number of plugged bio based on the number
of copies for original bio.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-8-yukuai1@huaweicloud.com
2023-06-13 15:25:44 -07:00
Yu Kuai 9efcc2c3df md/raid1-10: don't handle pluged bio by daemon thread
current->bio_list will be set under submit_bio() context, in this case
bitmap io will be added to the list and wait for current io submission to
finish, while current io submission must wait for bitmap io to be done.
commit 874807a831 ("md/raid1{,0}: fix deadlock in bitmap_unplug.") fix
the deadlock by handling plugged bio by daemon thread.

On the one hand, the deadlock won't exist after commit a214b949d8
("blk-mq: only flush requests from the plug in blk_mq_submit_bio"). On
the other hand, current solution makes it impossible to flush plugged bio
in raid1/10_make_request(), because this will cause that all the writes
will goto daemon thread.

In order to limit the number of plugged bio, commit 874807a831
("md/raid1{,0}: fix deadlock in bitmap_unplug.") is reverted, and the
deadlock is fixed by handling bitmap io asynchronously.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-7-yukuai1@huaweicloud.com
2023-06-13 15:25:44 -07:00
Yu Kuai 8295efbe68 md/raid1-10: factor out a helper to submit normal write
There are multiple places to do the same thing, factor out a helper to
prevent redundant code, and the helper will be used in following patch
as well.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-4-yukuai1@huaweicloud.com
2023-06-13 15:25:43 -07:00
Yu Kuai 5ec6ca140a md/raid1-10: factor out a helper to add bio to plug
The code in raid1 and raid10 is identical, prepare to limit the number
of plugged bios.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-3-yukuai1@huaweicloud.com
2023-06-13 15:25:43 -07:00
Yu Kuai 4469315439 md: protect md_thread with rcu
Currently, there are many places that md_thread can be accessed without
protection, following are known scenarios that can cause
null-ptr-dereference or uaf:

1) sync_thread that is allocated and started from md_start_sync()
2) mddev->thread can be accessed directly from timeout_store() and
   md_bitmap_daemon_work()
3) md_unregister_thread() from action_store().

Currently, a global spinlock 'pers_lock' is borrowed to protect
'mddev->thread' in some places, this problem can be fixed likewise,
however, use a global lock for all the cases is not good.

Fix this problem by protecting all md_thread with rcu.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
2023-06-13 15:25:39 -07:00
Johannes Thumshirn f83123223a md: raid1: use __bio_add_page for adding single page to bio
The sync request code uses bio_add_page() to add a page to a newly created bio.
bio_add_page() can fail, but the return value is never checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/6cf7f66c6e646231200d025dfd5f2d3ae75c8fe5.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-01 09:13:31 -06:00
Johannes Thumshirn b42473cdba md: check for failure when adding pages in alloc_behind_master_bio
alloc_behind_master_bio() can possibly add multiple pages to a bio, but it
is not checking for the return value of bio_add_page() if adding really
succeeded.

Check if the page adding succeeded and if not bail out.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/827aa12d44ebf3f50b41b47f5cedc0f80179f2c1.1685532726.git.johannes.thumshirn@wdc.com
[axboe: fold in s/free_page/put_page fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-01 09:13:13 -06:00
Christoph Hellwig c34b7ac650 block: remove bio_set_op_attrs
This macro is obsolete, so replace the last few uses with open coded
bi_opf assignments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de <mailto:colyli@suse.de>>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221206144057.720846-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 09:43:12 -07:00
Jiang Li b611ad1400 md/raid1: stop mdx_raid1 thread when raid1 array run failed
fail run raid1 array when we assemble array with the inactive disk only,
but the mdx_raid1 thread were not stop, Even if the associated resources
have been released. it will caused a NULL dereference when we do poweroff.

This causes the following Oops:
    [  287.587787] BUG: kernel NULL pointer dereference, address: 0000000000000070
    [  287.594762] #PF: supervisor read access in kernel mode
    [  287.599912] #PF: error_code(0x0000) - not-present page
    [  287.605061] PGD 0 P4D 0
    [  287.607612] Oops: 0000 [#1] SMP NOPTI
    [  287.611287] CPU: 3 PID: 5265 Comm: md0_raid1 Tainted: G     U            5.10.146 #0
    [  287.619029] Hardware name: xxxxxxx/To be filled by O.E.M, BIOS 5.19 06/16/2022
    [  287.626775] RIP: 0010:md_check_recovery+0x57/0x500 [md_mod]
    [  287.632357] Code: fe 01 00 00 48 83 bb 10 03 00 00 00 74 08 48 89 ......
    [  287.651118] RSP: 0018:ffffc90000433d78 EFLAGS: 00010202
    [  287.656347] RAX: 0000000000000000 RBX: ffff888105986800 RCX: 0000000000000000
    [  287.663491] RDX: ffffc90000433bb0 RSI: 00000000ffffefff RDI: ffff888105986800
    [  287.670634] RBP: ffffc90000433da0 R08: 0000000000000000 R09: c0000000ffffefff
    [  287.677771] R10: 0000000000000001 R11: ffffc90000433ba8 R12: ffff888105986800
    [  287.684907] R13: 0000000000000000 R14: fffffffffffffe00 R15: ffff888100b6b500
    [  287.692052] FS:  0000000000000000(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000
    [  287.700149] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  287.705897] CR2: 0000000000000070 CR3: 000000000320a000 CR4: 0000000000350ee0
    [  287.713033] Call Trace:
    [  287.715498]  raid1d+0x6c/0xbbb [raid1]
    [  287.719256]  ? __schedule+0x1ff/0x760
    [  287.722930]  ? schedule+0x3b/0xb0
    [  287.726260]  ? schedule_timeout+0x1ed/0x290
    [  287.730456]  ? __switch_to+0x11f/0x400
    [  287.734219]  md_thread+0xe9/0x140 [md_mod]
    [  287.738328]  ? md_thread+0xe9/0x140 [md_mod]
    [  287.742601]  ? wait_woken+0x80/0x80
    [  287.746097]  ? md_register_thread+0xe0/0xe0 [md_mod]
    [  287.751064]  kthread+0x11a/0x140
    [  287.754300]  ? kthread_park+0x90/0x90
    [  287.757974]  ret_from_fork+0x1f/0x30

In fact, when raid1 array run fail, we need to do
md_unregister_thread() before raid1_free().

Signed-off-by: Jiang Li <jiang.li@ugreen.com>
Signed-off-by: Song Liu <song@kernel.org>
2022-11-14 10:15:35 -08:00
Bart Van Assche 3c5e514db5 md/raid1: Use the new blk_opf_t type
Improve static type checking by using the new blk_opf_t type for
variables that represent request flags.

Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-35-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 12:14:31 -06:00
Bart Van Assche 4ce4c73f66 md/core: Combine two sync_page_io() arguments
Improve uniformity in the kernel of handling of request operation and
flags by passing these as a single argument.

Cc: Song Liu <song@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-32-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 12:14:31 -06:00
Christoph Hellwig 900d156bac block: remove bdevname
Replace the remaining calls of bdevname with snprintf using the %pg
format specifier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220713055317.1888500-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 10:27:56 -06:00
Christoph Hellwig 913cce5a1e md: remove most calls to bdevname
Use the %pg format specifier to save on stack consumption and code size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
2022-05-22 23:07:21 -07:00
Mariusz Tkaczyk 9631abdbf4 md: Set MD_BROKEN for RAID1 and RAID10
There is no direct mechanism to determine raid failure outside
personality. It is done by checking rdev->flags after executing
md_error(). If "faulty" flag is not set then -EBUSY is returned to
userspace. -EBUSY means that array will be failed after drive removal.

Mdadm has special routine to handle the array failure and it is executed
if -EBUSY is returned by md.

There are at least two known reasons to not consider this mechanism
as correct:
1. drive can be removed even if array will be failed[1].
2. -EBUSY seems to be wrong status. Array is not busy, but removal
   process cannot proceed safe.

-EBUSY expectation cannot be removed without breaking compatibility
with userspace. In this patch first issue is resolved by adding support
for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in
next commit.

The idea is to set the MD_BROKEN if we are sure that raid is in failed
state now. This is done in each error_handler(). In md_error() MD_BROKEN
flag is checked. If is set, then -EBUSY is returned to userspace.

As in previous commit, it causes that #mdadm --set-faulty is able to
fail array. Previously proposed workaround is valid if optional
functionality[1] is disabled.

[1] commit 9a567843f7ce("md: allow last device to be forcibly removed from
    RAID1/RAID10.")

Reviewd-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
2022-04-25 14:00:34 -07:00
Christoph Hellwig 70200574cc block: remove QUEUE_FLAG_DISCARD
Just use a non-zero max_discard_sectors as an indicator for discard
support, similar to what is done for write zeroes.

The only places where needs special attention is the RAID5 driver,
which must clear discard support for security reasons by default,
even if the default stacking rules would allow for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 19:49:59 -06:00
Christoph Hellwig 10f0d2a517 block: add a bdev_nonrot helper
Add a helper to check the nonrot flag based on the block_device instead
of having to poke into the block layer internal request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 19:49:59 -06:00
Christoph Hellwig 066ff57101 block: turn bio_kmalloc into a simple kmalloc wrapper
Remove the magic autofree semantics and require the callers to explicitly
call bio_init to initialize the bio.

This allows bio_free to catch accidental bio_put calls on bio_init()ed
bios as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20220406061228.410163-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 19:30:41 -06:00
Linus Torvalds 561593a048 for-5.18/write-streams-2022-03-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI1AHwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplPjEACVJzKg5NkxpdkDThvq5tejws9KxB/4mHJg
 NoDMcv1TF+Orsd/HNW6XrgYnbU0ObHom3568/xb8BNegRVFe7V4ME/4IYNRyGOmV
 qbfciu04L1UkJhI52CIidkOioFABL3r1zgLCIz5vk0Cv9X7Le9x0UabHxJf7u9S+
 Z3lNdyxezN0SGx8VT86l/7lSoHtG3VHO9IsQCuNGF02SB+6uGpXBlptbEoQ4nTxd
 T7/H9FNOe2Wf7eKvcOOds8UlvZYAfYcY0GcRrIOXdHIy25mKFWwn5cDgFTMOH5ID
 xXpm+JFkDkrfSW1o4FFPxbN9Z6RbVXbGCsrXlIragLO2MJQdXiIUxS1OPT5oAado
 H9MlX6QtkwziLW9zUWa/N/jmRjc2vzHAxD6JFg/wXxNdtY0kd8TQpaxwTB8mVDPN
 VCGutt7lJS1CQInQ+ppzbdqzzuLHC1RHAyWSmfUE9rb8cbjxtJBnSIorYRLUesMT
 GRwqVTXW0osxSgCb1iDiBCJANrX1yPZcemv4Wh1gzbT6IE9sWxWXsE5sy9KvswNc
 M+E4nu/TYYTfkynItJjLgmDLOoi+V0FBY6ba0mRPBjkriSP4AVlwsZLGVsAHQzuA
 o5paW1GjRCCwhIQ6+AzZIoOz6wqvprBlUgUkUneyYAQ2ZKC3pZi8zPnpoVdFucVa
 VaTzP71C1Q==
 =efaq
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block

Pull NVMe write streams removal from Jens Axboe:
 "This removes the write streams support in NVMe. No vendor ever really
  shipped working support for this, and they are not interested in
  supporting it.

  With the NVMe support gone, we have nothing in the tree that supports
  this. Remove passing around of the hints.

  The only discussion point in this patchset imho is the fact that the
  file specific write hint setting/getting fcntl helpers will now return
  -1/EINVAL like they did before we supported write hints. No known
  applications use these functions, I only know of one prototype that I
  help do for RocksDB, and that's not used. That said, with a change
  like this, it's always a bit controversial. Alternatively, we could
  just make them return 0 and pretend it worked. It's placement based
  hints after all"

* tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block:
  fs: remove fs.f_write_hint
  fs: remove kiocb.ki_hint
  block: remove the per-bio/request write hint
  nvme: remove support or stream based temperature hint
2022-03-26 11:51:46 -07:00
Linus Torvalds 6f2689a766 SCSI misc on 20220324
This series consists of the usual driver updates (qla2xxx, pm8001,
 libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates
 and bug fixes.  The high blast radius core update is the removal of
 write same, which affects block and several non-SCSI devices.  The
 other big change, which is more local, is the removal of the SCSI
 pointer.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYjzDQyYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishQMYAQDEWUGV
 6U0+736AHVtOfiMNfiRN79B1HfXVoHvemnPcTwD/UlndwFfy/3GGOtoZmqEpc73J
 Ec1HDuUCE18H1H2QAh0=
 =/Ty9
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "This series consists of the usual driver updates (qla2xxx, pm8001,
  libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates
  and bug fixes.

  The high blast radius core update is the removal of write same, which
  affects block and several non-SCSI devices. The other big change,
  which is more local, is the removal of the SCSI pointer"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (281 commits)
  scsi: scsi_ioctl: Drop needless assignment in sg_io()
  scsi: bsg: Drop needless assignment in scsi_bsg_sg_io_fn()
  scsi: lpfc: Copyright updates for 14.2.0.0 patches
  scsi: lpfc: Update lpfc version to 14.2.0.0
  scsi: lpfc: SLI path split: Refactor BSG paths
  scsi: lpfc: SLI path split: Refactor Abort paths
  scsi: lpfc: SLI path split: Refactor SCSI paths
  scsi: lpfc: SLI path split: Refactor CT paths
  scsi: lpfc: SLI path split: Refactor misc ELS paths
  scsi: lpfc: SLI path split: Refactor VMID paths
  scsi: lpfc: SLI path split: Refactor FDISC paths
  scsi: lpfc: SLI path split: Refactor LS_RJT paths
  scsi: lpfc: SLI path split: Refactor LS_ACC paths
  scsi: lpfc: SLI path split: Refactor the RSCN/SCR/RDF/EDC/FARPR paths
  scsi: lpfc: SLI path split: Refactor PLOGI/PRLI/ADISC/LOGO paths
  scsi: lpfc: SLI path split: Refactor base ELS paths and the FLOGI path
  scsi: lpfc: SLI path split: Introduce lpfc_prep_wqe
  scsi: lpfc: SLI path split: Refactor fast and slow paths to native SLI4
  scsi: lpfc: SLI path split: Refactor lpfc_iocbq
  scsi: lpfc: Use kcalloc()
  ...
2022-03-24 19:37:53 -07:00
Linus Torvalds 69d1dea852 for-5.18/drivers-2022-03-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI0/QUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpn8GEACRVxJaJV5qjZfoFAQKoAWJEtquwjeARyB+
 0V8ROWHDWHSacdug9wBytayiS1lz2zmUHJ6YXyts2dn0v6CrK4s8yGzk5G/RgH6+
 6M3GmBKjj+r1DfE8L3OoQWkDR1JFPuFxXTG/uBd7fBY2Excih1Z0D2lpspMleIRf
 w8zBrlWrWH8lZlm6HF3fadjEoiWhOM5F4Ofz3eg/PAQrHuD06z8hjQgMeR0jQVzw
 bWF9jrdNIplxRjNWIwCTsQRM+z5KQhUGwDODJjIwdQtVaKSt9D99ZbeKTudlslQ2
 zrizsCq8P1RjBPcrA45FV6QnT9DIRRGrYzHD63qC6fDae34rbzdSHUwRMP2XSxo8
 +hT1AzGypiBauODTPzHFtTskaQ0KibLznEanChh/ThySmNYcEVAljSx3Z5Vo81J+
 IqJYK2m3RESCFruy9w3U/P7qiXZmqYldPfjxAKq8ucg6x1PU3XRAVm7SI/i4l75D
 Crk1ujj2LJgsyxL6qMrK3XUavl1SJdzWeFSarcCt3m4m11EWWfYzmG8Yn8OE2CEZ
 a2CAyDsRi8CZ3hvkaMwigL4wBJjrrig8vyIgok3VrfCmYlNNqMQqM5Rw7vzjR3v1
 cKewI3rQjkFXEaveIXyGPTI/0Da4cT0DOfn/Mws9MDUXNPlFMNEDUZkPuzMywiTB
 2SWDLTe77g==
 =993h
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18/drivers-2022-03-18' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - NVMe updates via Christoph:
      - add vectored-io support for user-passthrough (Kanchan Joshi)
      - add verbose error logging (Alan Adamson)
      - support buffered I/O on block devices in nvmet (Chaitanya
        Kulkarni)
      - central discovery controller support (Martin Belanger)
      - fix and extended the globally unique idenfier validation
        (Christoph)
      - move away from the deprecated IDA APIs (Sagi Grimberg)
      - misc code cleanup (Keith Busch, Max Gurtovoy, Qinghua Jin,
        Chaitanya Kulkarni)
      - add lockdep annotations for in-kernel sockets (Chris Leech)
      - use vmalloc for ANA log buffer (Hannes Reinecke)
      - kerneldoc fixes (Chaitanya Kulkarni)
      - cleanups (Guoqing Jiang, Chaitanya Kulkarni, Christoph)
      - warn about shared namespaces without multipathing (Christoph)

 - MD updates via Song with a set of cleanups (Christoph, Mariusz, Paul,
   Erik, Dirk)

 - loop cleanups and queue depth configuration (Chaitanya)

 - null_blk cleanups and fixes (Chaitanya)

 - Use descriptive init/exit names in virtio_blk (Randy)

 - Use bvec_kmap_local() in drivers (Christoph)

 - bcache fixes (Mingzhe)

 - xen blk-front persistent grant speedups (Juergen)

 - rnbd fix and cleanup (Gioh)

 - Misc fixes (Christophe, Colin)

* tag 'for-5.18/drivers-2022-03-18' of git://git.kernel.dk/linux-block: (76 commits)
  virtio_blk: eliminate anonymous module_init & module_exit
  nvme: warn about shared namespaces without CONFIG_NVME_MULTIPATH
  nvme: remove nvme_alloc_request and nvme_alloc_request_qid
  nvme: cleanup how disk->disk_name is assigned
  nvmet: move the call to nvmet_ns_changed out of nvmet_ns_revalidate
  nvmet: use snprintf() with PAGE_SIZE in configfs
  nvmet: don't fold lines
  nvmet-rdma: fix kernel-doc warning for nvmet_rdma_device_removal
  nvmet-fc: fix kernel-doc warning for nvmet_fc_unregister_targetport
  nvmet-fc: fix kernel-doc warning for nvmet_fc_register_targetport
  nvme-tcp: lockdep: annotate in-kernel sockets
  nvme-tcp: don't fold the line
  nvme-tcp: don't initialize ret variable
  nvme-multipath: call bio_io_error in nvme_ns_head_submit_bio
  nvme-multipath: use vmalloc for ANA log buffer
  xen/blkfront: speed up purge_persistent_grants()
  raid5: initialize the stripe_head embeeded bios as needed
  raid5-cache: statically allocate the recovery ra bio
  raid5-cache: fully initialize flush_bio when needed
  raid5-ppl: fully initialize the bio in ppl_new_iounit
  ...
2022-03-21 17:16:01 -07:00
Mariusz Tkaczyk daae161fd2 md: raid1/raid10: drop pending_cnt
Those counters are not necessary after commit 11bb45e8aaf6 ("md: drop queue
limitation for RAID1 and RAID10"). Remove them from all code (conf and
plug structs). raid1_plug_cb and raid10_plug_cb are identical, so move
definition of raid1_plug_cb to common raid1-10 definitions and use it for
RAID10 too.

Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
2022-03-08 15:16:54 -08:00
Christoph Hellwig c75e707fe1 block: remove the per-bio/request write hint
With the NVMe support for this gone, there are no consumers of these hints
left, so remove them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07 12:45:57 -07:00
Christoph Hellwig ac483eb375 raid1: stop using bio_devname
Use the %pg format specifier to save on stack consuption and code size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220304180105.409765-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07 06:42:33 -07:00
Christoph Hellwig 10fa225c33 scsi: md: Remove WRITE_SAME support
There are no more end-users of REQ_OP_WRITE_SAME left, so we can start
deleting it.

Link: https://lore.kernel.org/r/20220209082828.2629273-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-22 21:11:08 -05:00
Christoph Hellwig abfc426d1b block: pass a block_device to bio_clone_fast
Pass a block_device to bio_clone_fast and __bio_clone_fast and give
the functions more suitable names.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-04 07:43:18 -07:00
Christoph Hellwig a7c50c9404 block: pass a block_device and opf to bio_reset
Pass the block_device that we plan to use this bio for and the
operation to bio_reset to optimize the assigment.  A NULL block_device
can be passed, both for the passthrough case on a raw request_queue and
to temporarily avoid refactoring some nasty code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02 07:50:00 -07:00