Commit Graph

43 Commits

Author SHA1 Message Date
Ming Lei 23e79f75e6 ublk_drv: remove nr_aborted_queues from ublk_device
[ Upstream commit ed878d1c1c ]

No one uses 'nr_aborted_queues' any more, so remove it.

Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230106041711.914434-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable-dep-of: 73a166d974 ("ublk_drv: don't probe partitions if the ubq daemon isn't trusted")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-10 09:32:41 +01:00
Liu Xiaodong ee1e3fe4b4 block: ublk: extending queue_size to fix overflow
[ Upstream commit 29baef789c ]

When validating drafted SPDK ublk target, in a case that
assigning large queue depth to multiqueue ublk device,
ublk target would run into a weird incorrect state. During
rounds of review and debug, An overflow bug was found
in ublk driver.

In ublk_cmd.h, UBLK_MAX_QUEUE_DEPTH is 4096 which means
each ublk queue depth can be set as large as 4096. But
when setting qd for a ublk device,
sizeof(struct ublk_queue) + depth * sizeof(struct ublk_io)
will be larger than 65535 if qd is larger than 2728.
Then queue_size is overflowed, and ublk_get_queue()
references a wrong pointer position. The wrong content of
ublk_queue elements will lead to out-of-bounds memory
access.

Extend queue_size in ublk_device as "unsigned int".

Signed-off-by: Liu Xiaodong <xiaodong.liu@intel.com>
Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230131070552.115067-1-xiaodong.liu@intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-09 11:28:08 +01:00
Ming Lei fe10ce3041 block: ublk: move ublk_chr_class destroying after devices are removed
[ Upstream commit 8e4ff68476 ]

The 'ublk_chr_class' is needed when deleting ublk char devices in
ublk_exit(), so move it after devices(idle) are removed.

Fixes the following warning reported by Harris, James R:

[  859.178950] sysfs group 'power' not found for kobject 'ublkc0'
[  859.178962] WARNING: CPU: 3 PID: 1109 at fs/sysfs/group.c:278 sysfs_remove_group+0x9c/0xb0

Reported-by: "Harris, James R" <james.r.harris@intel.com>
Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Link: https://lore.kernel.org/linux-block/Y9JlFmSgDl3+zy3N@T590/T/#t
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Jim Harris <james.r.harris@intel.com>
Link: https://lore.kernel.org/r/20230126115346.263344-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01 08:34:49 +01:00
Ming Lei 4e0c2961e5 ublk: honor IO_URING_F_NONBLOCK for handling control command
[ Upstream commit fa8e442e83 ]

Most of control command handlers may sleep, so return -EAGAIN in case
of IO_URING_F_NONBLOCK to defer the handling into io wq context.

Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230104133235.836536-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-12 12:02:32 +01:00
Ming Lei 7d4a93176e ublk_drv: don't forward io commands in reserve order
Either ublk_can_use_task_work() is true or not, io commands are
forwarded to ublk server in reverse order, since llist_add() is
always to add one element to the head of the list.

Even though block layer doesn't guarantee request dispatch order,
requests should be sent to hardware in the sequence order generated
from io scheduler, which usually considers the request's LBA, and
order is often important for HDD.

So forward io commands in the sequence made from io scheduler by
aligning task work with current io_uring command's batch handling,
and it has been observed that both can get similar performance data
if IORING_SETUP_COOP_TASKRUN is set from ublk server.

Reported-by: Andreas Hindborg <andreas.hindborg@wdc.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221121155645.396272-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23 20:36:57 -07:00
Ming Lei fee32f3124 ublk_drv: add ublk_queue_cmd() for cleanup
Add helper of ublk_queue_cmd() so that both ublk_queue_rq()
and ublk_handle_need_get_data() can reuse this helper.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221029010432.598367-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-31 07:23:24 -06:00
Ming Lei 3ab6e94ca5 ublk_drv: avoid to touch io_uring cmd in blk_mq io path
io_uring cmd is supposed to be used in ubq daemon context mainly,
and we should try to avoid to touch it in ublk io submission context,
otherwise this data could become shared between the two contexts,
and performance is hurt.

So link request into one per-queue list, and use same batching policy
of io_uring command, just avoid to touch ucmd in blk-mq io context.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221029010432.598367-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-31 07:23:24 -06:00
Ming Lei 224e858f21 ublk_drv: return flag of UBLK_F_URING_CMD_COMP_IN_TASK in case of module
UBLK_F_URING_CMD_COMP_IN_TASK needs to be set and returned to userspace
if ublk driver is built as module, otherwise userspace may get wrong
flags shown.

Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221029010432.598367-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-31 07:23:16 -06:00
Yushan Zhou 72495b5ab4 ublk_drv: use flexible-array member instead of zero-length array
Eliminate the following coccicheck warning:
./drivers/block/ublk_drv.c:127:16-19: WARNING use flexible-array member instead

Signed-off-by: Yushan Zhou <katrinzhou@tencent.com>
Link: https://lore.kernel.org/r/20221018100132.355393-1-zys.zljxml@gmail.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-19 18:27:23 -07:00
ZiyangZhang c732a852b4 ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support
START_USER_RECOVERY and END_USER_RECOVERY are two new control commands
to support user recovery feature.

After a crash, user should send START_USER_RECOVERY, it will:
(1) check if (a)current ublk_device is UBLK_S_DEV_QUIESCED which was
    set by quiesce_work and (b)chardev is released
(2) reinit all ubqs, including:
    (a) put the task_struct and reset ->ubq_daemon to NULL.
    (b) reset all ublk_io.
(3) reset ub->mm to NULL.

Then, user should start a new process and send FETCH_REQ on each
ubq_daemon.

Finally, user should send END_USER_RECOVERY, it will:
(1) wait for all new ubq_daemons getting ready.
(2) update ublksrv_pid
(3) unquiesce the request queue and expect incoming ublk_queue_rq()
(4) convert ub's state to UBLK_S_DEV_LIVE

Note: we can handle STOP_DEV between START_USER_RECOVERY and
END_USER_RECOVERY. This is helpful to users who cannot start new process
after sending START_USER_RECOVERY ctrl-cmd.

Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-7-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23 19:09:56 -06:00
ZiyangZhang a0d41dc113 ublk_drv: support UBLK_F_USER_RECOVERY_REISSUE
UBLK_F_USER_RECOVERY_REISSUE implies that:
With a dying ubq_daemon, ublk_drv let monitor_work requeues rq issued to
userspace(ublksrv) before the ubq_daemon is dying.

UBLK_F_USER_RECOVERY_REISSUE is designed for backends which:
(1) tolerate double-write since ublk_drv may issue the same rq
    twice.
(2) does not let frontend users get I/O error, such as read-only FS
    and VM backend.

Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-6-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23 19:09:56 -06:00
ZiyangZhang bbae8d1f52 ublk_drv: consider recovery feature in aborting mechanism
With USER_RECOVERY feature enabled, the monitor_work schedules
quiesce_work after finding a dying ubq_daemon. The monitor_work
should also abort all rqs issued to userspace before the ubq_daemon is
dying. The quiesce_work's job is to:
(1) quiesce request queue.
(2) check if there is any INFLIGHT rq. If so, we retry until all these
    rqs are requeued and become IDLE. These rqs should be requeued by
	ublk_queue_rq(), task work, io_uring fallback wq or monitor_work.
(3) complete all ioucmds by calling io_uring_cmd_done(). We are safe to
    do so because no ioucmd can be referenced now.
(5) set ub's state to UBLK_S_DEV_QUIESCED, which means we are ready for
    recovery. This state is exposed to userspace by GET_DEV_INFO.

The driver can always handle STOP_DEV and cleanup everything no matter
ub's state is LIVE or QUIESCED. After ub's state is UBLK_S_DEV_QUIESCED,
user can recover with new process.

Note: we do not change the default behavior with reocvery feature
disabled. monitor_work still schedules stop_work and abort inflight
rqs. And finally ublk_device is released.

Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-5-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23 19:09:56 -06:00
ZiyangZhang 42cf5fc5ee ublk_drv: requeue rqs with recovery feature enabled
With recovery feature enabled, in ublk_queue_rq or task work
(in exit_task_work or fallback wq), we requeue rqs instead of
ending(aborting) them. Besides, No matter recovery feature is enabled
or disabled, we schedule monitor_work immediately.

Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-4-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23 19:09:56 -06:00
ZiyangZhang 77a440e2cb ublk_drv: define macros for recovery feature and check them
Define some macros for recovery feature.

UBLK_S_DEV_QUIESCED implies that ublk_device is quiesced
and is ready for recovery. This state can be observed by userspace.

UBLK_F_USER_RECOVERY implies that:
(1) ublk_drv enables recovery feature. It won't let monitor_work to
    automatically abort rqs and release the device.
(2) With a dying ubq_daemon, ublk_drv ends(aborts) rqs issued to
    userspace(ublksrv) before crash.
(3) With a dying ubq_daemon, in task work and ublk_queue_rq(),
    ublk_drv requeues rqs.

Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-3-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23 19:09:56 -06:00
ZiyangZhang ae3f719300 ublk_drv: check 'current' instead of 'ubq_daemon'
This check is not atomic. So with recovery feature, ubq_daemon may be
modified simultaneously by recovery task. Instead, check 'current' is
safe here because 'current' never changes.

Also add comment explaining this check, which is really important for
understanding recovery feature.

Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220923153919.44078-2-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23 19:09:56 -06:00
ZiyangZhang e6190dd003 ublk_drv: do not add a re-issued request aborted previously to ioucmd's task_work
In ublk_queue_rq(), Assume current request is a re-issued request aborted
previously in monitor_work because the ubq_daemon(ioucmd's task) is
PF_EXITING. For this request, we cannot call
io_uring_cmd_complete_in_task() anymore because at that moment io_uring
context may be freed in case that no inflight ioucmd exists. Otherwise,
we may cause null-deref in ctx->fallback_work.

Add a check on UBLK_IO_FLAG_ABORTED to prevent the above situation. This
check is safe and makes sense.

Note: monitor_work sets UBLK_IO_FLAG_ABORTED and ends this request
(releasing the tag). Then the request is restarted(allocating the tag)
and we are here. Since releasing/allocating a tag implies smp_mb(),
finding UBLK_IO_FLAG_ABORTED guarantees that here is a re-issued request
aborted previously.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220815023633.259825-4-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-16 06:16:19 -06:00
ZiyangZhang bb24174754 ublk_drv: update comment for __ublk_fail_req()
Since __ublk_rq_task_work always fails requests immediately during
exiting, __ublk_fail_req() is only called from abort context during
exiting. So lock is unnecessary.

Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220815023633.259825-3-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-16 06:16:19 -06:00
ZiyangZhang 966120b51a ublk_drv: check ubq_daemon_is_dying() in __ublk_rq_task_work()
Replace direct check on PF_EXITING in __ublk_rq_task_work() by the
existing wrapper. Also inline ubq_daemon_is_dying().

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220815023633.259825-2-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-16 06:16:19 -06:00
ZiyangZhang 92cb6e2e5d ublk_drv: update iod->addr for UBLK_IO_NEED_GET_DATA
If ublksrv sends UBLK_IO_NEED_GET_DATA with new allocated io buffer, we
have to update iod->addr in task_work before calling io_uring_cmd_done().
Then usersapce target can handle (write)io request with the new io
buffer reading from updated iod.

Without this change, userspace target may touch a wrong io buffer!

Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220810055212.66417-1-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-13 08:35:28 -06:00
ZiyangZhang c86019ff75 ublk_drv: add support for UBLK_IO_NEED_GET_DATA
UBLK_IO_NEED_GET_DATA is one ublk IO command. It is designed for a user
application who wants to allocate IO buffer and set IO buffer address
only after it receives an IO request from ublksrv. This is a reasonable
scenario because these users may use a RPC framework as one IO backend
to handle IO requests passed from ublksrv. And a RPC framework may
allocate its own buffer(or memory pool).

This new feature (UBLK_F_NEED_GET_DATA) is optional for ublk users.
Related userspace code has been added in ublksrv[1] as one pull request.

Test cases for this feature are added in ublksrv and all the tests pass.
The performance result shows that this new feature does bring additional
latency because one IO is issued back to ublk_drv once again to copy data
from bio vectors to user-provided data buffer. UBLK_IO_NEED_GET_DATA is
suitable for bigger block size such as 512B or 1MB.

[1] https://github.com/ming1/ubdsrv

Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/3a21007ea1be8304246e654cebbd581ab0012623.1659011443.git.ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 21:13:40 -06:00
Ming Lei 4bf9cbf3e9 ublk_drv: cleanup ublksrv_ctrl_dev_info
Remove all block device related info from ublksrv_ctrl_dev_info,
meantime reduce its size into 64 bytes because:

1) ublksrv_ctrl_dev_info becomes cleaner without including any
block related info

2) generic set/get parameter command can be used to set block
related setting easily and cleanly

3) generic set/get parameter command can be used for extending
ublk without needing more info in ublksrv_ctrl_dev_info

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220730092750.1118167-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 21:13:40 -06:00
Ming Lei 0aa73170eb ublk_drv: add SET_PARAMS/GET_PARAMS control command
Add two commands to set/get parameters generically.

One important goal of ublk is to provide generic framework for making
block device by userspace flexibly.

As one generic block device, there are still lots of block parameters,
such as max_sectors, write_cache/fua, discard related limits,
zoned parameters, ...., so this patch starts to add generic mechanism
for set/get device parameters.

Both generic block parameters(all kinds of queue settings) and ublk
feature parameters can be covered with this way, then it becomes quite
easy to extend in future.

Add two parameter types are used so far: basic(covers basic queue setting
and misc settings which can't be grouped easily) and discard, basic type
must be set, and discard type becomes optional now

This way provides mechanism to simulate any kind of generic block device
from userspace easily, from both block queue setting viewpoint or ublk
feature viewpoint.

The style of putting all parameters together is suggested by Christoph.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220730092750.1118167-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 21:13:40 -06:00
Ming Lei 93d71ec89d ublk_drv: fix ublk device leak in case that add_disk fails
->free_disk is only called after disk is added successfully, so
drop ublk device reference in case of add_disk() failure.

Fixes: 6d9e6dfdf3 ("ublk: defer disk allocation")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220730092750.1118167-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 21:13:40 -06:00
Ming Lei a8ce5f52ef ublk_drv: cancel device even though disk isn't up
Each ublk queue is started before adding disk, we have to cancel queues in
ublk_stop_dev() so that ubq daemon can be exited, otherwise DEL_DEV command
may hang forever.

Also avoid to cancel queues two times by checking if queue is ready,
otherwise use-after-free on io_uring may be triggered because ublk_stop_dev
is called by ublk_remove() too.

Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220730092750.1118167-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02 21:13:40 -06:00
Dan Carpenter 8d9fdb6011 ublk_drv: fix double shift bug
The test/clear_bit() functions take a bit number, but this code is
passing as shifted value.  It's the equivalent of saying BIT(BIT(0))
instead of just BIT(0).

This doesn't affect runtime because numbers are small and it's done
consistently.

Fixes: fa36204556 ("ublk: simplify ublk_ch_open and ublk_ch_release")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/Yt/2R/+MJf/MSoyl@kili
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-26 12:30:07 -06:00
Ming Lei 6d8c5afc9a ublk_drv: make sure that correct flags(features) returned to userspace
Userspace may support more features or new added flags, but the driver
side can be old, so make sure correct flags(features) returned to
userpsace, then userspace can work as expected.

Also mark the 2nd flags as reversed, just use the 1st one. When we run
out of flags, the reserved one can be handled at that time.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220722103817.631258-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-22 09:22:53 -06:00
Christoph Hellwig fa9482e0b2 ublk_drv: fix error handling of ublk_add_dev
__ublk_destroy_dev() is called for handling error in ublk_add_dev(),
but either tagset isn't allocated or mutex isn't initialized.

So fix the issue by letting replacing ublk_add_dev with a
ublk_add_tag_set function that is much more limited in scope and
instead unwind every single step directly in ublk_ctrl_add_dev.
To allow for this refactor the device freeing so that there is
a helper for freeing the device number instead of coupling that
with freeing the mutex and the memory.

Note that this now copies the dev_info to userspace before adding
the character device.  This not only simplifies the erro handling
in ublk_ctrl_add_dev, but also means that the character device
can only be seen by userspace if the device addition succeeded.

Based on a patch from Ming Lei.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220722103817.631258-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-22 09:22:53 -06:00
Ming Lei e94eb459d3 ublk_drv: fix lockdep warning
ub->mutex is used to protecting reading and writing ub->mm, then the
following lockdep warning is triggered.

Fix it by using one dedicated spin lock for protecting ub->mm.

[1] lockdep warning
[   25.046186] ======================================================
[   25.048886] WARNING: possible circular locking dependency detected
[   25.051610] 5.19.0-rc4_for-v5.20+ #149 Not tainted
[   25.053665] ------------------------------------------------------
[   25.056334] ublk/989 is trying to acquire lock:
[   25.058296] ffff975d0329a918 (&disk->open_mutex){+.+.}-{3:3}, at: bd_register_pending_holders+0x2a/0x110
[   25.063678]
[   25.063678] but task is already holding lock:
[   25.066246] ffff975d1df59708 (&ub->mutex){+.+.}-{3:3}, at: ublk_ctrl_uring_cmd+0x2df/0x730
[   25.069423]
[   25.069423] which lock already depends on the new lock.
[   25.069423]
[   25.072603]
[   25.072603] the existing dependency chain (in reverse order) is:
[   25.074908]
[   25.074908] -> #3 (&ub->mutex){+.+.}-{3:3}:
[   25.076386]        __mutex_lock+0x93/0x870
[   25.077470]        ublk_ch_mmap+0x3a/0x140
[   25.078494]        mmap_region+0x375/0x5a0
[   25.079386]        do_mmap+0x33a/0x530
[   25.080168]        vm_mmap_pgoff+0xb9/0x150
[   25.080979]        ksys_mmap_pgoff+0x184/0x1f0
[   25.081838]        do_syscall_64+0x37/0x80
[   25.082653]        entry_SYSCALL_64_after_hwframe+0x46/0xb0
[   25.083730]
[   25.083730] -> #2 (&mm->mmap_lock#2){++++}-{3:3}:
[   25.084707]        __might_fault+0x55/0x80
[   25.085344]        _copy_from_user+0x1e/0xa0
[   25.086020]        get_sg_io_hdr+0x26/0xb0
[   25.086651]        scsi_ioctl+0x42f/0x960
[   25.087267]        sr_block_ioctl+0xe8/0x100
[   25.087734]        blkdev_ioctl+0x134/0x2b0
[   25.088196]        __x64_sys_ioctl+0x8a/0xc0
[   25.088677]        do_syscall_64+0x37/0x80
[   25.089044]        entry_SYSCALL_64_after_hwframe+0x46/0xb0
[   25.089548]
[   25.089548] -> #1 (&cd->lock){+.+.}-{3:3}:
[   25.090072]        __mutex_lock+0x93/0x870
[   25.090452]        sr_block_open+0x64/0xe0
[   25.090837]        blkdev_get_whole+0x26/0x90
[   25.091445]        blkdev_get_by_dev.part.0+0x1ce/0x2f0
[   25.092203]        blkdev_open+0x52/0x90
[   25.092617]        do_dentry_open+0x1ca/0x360
[   25.093499]        path_openat+0x78d/0xcb0
[   25.094136]        do_filp_open+0xa1/0x130
[   25.094759]        do_sys_openat2+0x76/0x130
[   25.095454]        __x64_sys_openat+0x5c/0x70
[   25.096078]        do_syscall_64+0x37/0x80
[   25.096637]        entry_SYSCALL_64_after_hwframe+0x46/0xb0
[   25.097304]
[   25.097304] -> #0 (&disk->open_mutex){+.+.}-{3:3}:
[   25.098229]        __lock_acquire+0x12e2/0x1f90
[   25.098789]        lock_acquire+0xbf/0x2c0
[   25.099256]        __mutex_lock+0x93/0x870
[   25.099706]        bd_register_pending_holders+0x2a/0x110
[   25.100246]        device_add_disk+0x209/0x370
[   25.100712]        ublk_ctrl_uring_cmd+0x405/0x730
[   25.101205]        io_issue_sqe+0xfe/0x2ac0
[   25.101665]        io_submit_sqes+0x352/0x1820
[   25.102131]        __do_sys_io_uring_enter+0x848/0xdc0
[   25.102646]        do_syscall_64+0x37/0x80
[   25.103087]        entry_SYSCALL_64_after_hwframe+0x46/0xb0
[   25.103640]
[   25.103640] other info that might help us debug this:
[   25.103640]
[   25.104549] Chain exists of:
[   25.104549]   &disk->open_mutex --> &mm->mmap_lock#2 --> &ub->mutex
[   25.104549]
[   25.105611]  Possible unsafe locking scenario:
[   25.105611]
[   25.106258]        CPU0                    CPU1
[   25.106677]        ----                    ----
[   25.107100]   lock(&ub->mutex);
[   25.107446]                                lock(&mm->mmap_lock#2);
[   25.108045]                                lock(&ub->mutex);
[   25.108802]   lock(&disk->open_mutex);
[   25.109265]
[   25.109265]  *** DEADLOCK ***
[   25.109265]
[   25.110117] 2 locks held by ublk/989:
[   25.110490]  #0: ffff975d07bbf8a8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x83e/0xdc0
[   25.111249]  #1: ffff975d1df59708 (&ub->mutex){+.+.}-{3:3}, at: ublk_ctrl_uring_cmd+0x2df/0x730
[   25.111943]
[   25.111943] stack backtrace:
[   25.112557] CPU: 2 PID: 989 Comm: ublk Not tainted 5.19.0-rc4_for-v5.20+ #149
[   25.113137] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1.fc33 04/01/2014
[   25.113792] Call Trace:
[   25.114130]  <TASK>
[   25.114417]  dump_stack_lvl+0x71/0xa0
[   25.114771]  check_noncircular+0xdf/0x100
[   25.115137]  ? register_lock_class+0x38/0x470
[   25.115524]  __lock_acquire+0x12e2/0x1f90
[   25.115887]  ? find_held_lock+0x2b/0x80
[   25.116244]  lock_acquire+0xbf/0x2c0
[   25.116590]  ? bd_register_pending_holders+0x2a/0x110
[   25.117009]  __mutex_lock+0x93/0x870
[   25.117362]  ? bd_register_pending_holders+0x2a/0x110
[   25.117780]  ? bd_register_pending_holders+0x2a/0x110
[   25.118201]  ? kobject_add+0x71/0x90
[   25.118546]  ? bd_register_pending_holders+0x2a/0x110
[   25.118958]  bd_register_pending_holders+0x2a/0x110
[   25.119373]  device_add_disk+0x209/0x370
[   25.119732]  ublk_ctrl_uring_cmd+0x405/0x730
[   25.120109]  ? rcu_read_lock_sched_held+0x3c/0x70
[   25.120514]  io_issue_sqe+0xfe/0x2ac0
[   25.120863]  io_submit_sqes+0x352/0x1820
[   25.121228]  ? rcu_read_lock_sched_held+0x3c/0x70
[   25.121626]  ? __do_sys_io_uring_enter+0x83e/0xdc0
[   25.122028]  ? find_held_lock+0x2b/0x80
[   25.122390]  ? __do_sys_io_uring_enter+0x848/0xdc0
[   25.122791]  __do_sys_io_uring_enter+0x848/0xdc0
[   25.123190]  ? syscall_enter_from_user_mode+0x20/0x70
[   25.123606]  ? syscall_enter_from_user_mode+0x20/0x70
[   25.124024]  do_syscall_64+0x37/0x80
[   25.124383]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[   25.124829] RIP: 0033:0x7f120a762af6
[   25.125223] Code: 45 c1 41 89 c2 41 b9 08 00 00 00 41 83 ca 10 f6 87 d0 00 00 00 01 8b bf cc 00 00 00 44 0f 44 d0 45 31 c0c
[   25.126576] RSP: 002b:00007ffdcb3c5518 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
[   25.127153] RAX: ffffffffffffffda RBX: 00000000013aef50 RCX: 00007f120a762af6
[   25.127748] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004
[   25.128351] RBP: 000000000000000b R08: 0000000000000000 R09: 0000000000000008
[   25.128956] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcb3c74a6
[   25.129524] R13: 00000000013aef50 R14: 0000000000000000 R15: 00000000000003df
[   25.130121]  </TASK>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220721153117.591394-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 13:14:44 -06:00
Christoph Hellwig 6d9e6dfdf3 ublk: defer disk allocation
Defer allocating the gendisk and request_queue until UBLK_CMD_START_DEV
is called.  This avoids funky life times where a disk is allocated
and then can be added and removed multiple times, which has never been
supported by the block layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220721130916.1869719-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 10:52:12 -06:00
Christoph Hellwig c50061f0f1 ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask
Looking at the hctxs and cpumap is not safe without at very last a RCU
reference.  It also requires the queue to be set up before starting the
device, which leads to rather awkward life time rules.

Instead rewrite ublk_ctrl_get_queue_affinity to just build the cpumask
directly from the mq_map in the tag set, similar to hctx->cpumask is
built.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220721130916.1869719-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 10:52:12 -06:00
Christoph Hellwig cfee7e4de2 ublk: fold __ublk_create_dev into ublk_ctrl_add_dev
Fold __ublk_create_dev into its only caller to avoid the packing and
unpacking of the return value into an ERR_PTR.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220721130916.1869719-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 10:52:12 -06:00
Christoph Hellwig 34d8f2bea5 ublk: cleanup ublk_ctrl_uring_cmd
Move all per-command work into the per-command ublk_ctrl_* helpers
instead of being split over those, ublk_ctrl_cmd_validate, and the main
ublk_ctrl_uring_cmd handler.  To facilitate that, the old
ublk_ctrl_stop_dev function that just contained two function calls is
folded into both callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220721130916.1869719-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 10:52:12 -06:00
Christoph Hellwig fa36204556 ublk: simplify ublk_ch_open and ublk_ch_release
fops->open and fops->release are always paired.  Use simple atomic bit
ops ot indicate if the device is opened instead of a count that can
only be 0 and 1 and a useless cmpxchg loop in ublk_ch_release.

Also don't bother clearing file->private_data is the file is about to
be freed anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220721130916.1869719-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 10:52:12 -06:00
Christoph Hellwig 49d686ccee ublk: remove the empty open and release block device operations
No need to define empty versions, they can just be left out.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220721130916.1869719-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 10:52:12 -06:00
Christoph Hellwig 5f8bcc837a ublk: remove UBLK_IO_F_PREFLUSH
REQ_PREFLUSH is turned into REQ_OP_FLUSH by the flush state machine
and thus never seen by a blk-mq based driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220721130916.1869719-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 10:52:12 -06:00
Dan Carpenter fe3333f695 ublk_drv: fix an IS_ERR() vs NULL check
The blk_mq_alloc_disk_for_queue() doesn't return error pointers, it
returns NULL on error.

Fixes: cebbe577cb ("ublk_drv: fix request queue leak")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/YtVAgedTsQVK1oTM@kili
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-18 13:11:11 -06:00
Christoph Hellwig d276a22314 ublk: remove UBLK_IO_F_INTEGRITY
The ublk protocol has no mechanism to actually transfer the integrity
metadata, so don't define this flag, which requires that an integrity
payload is attached to a bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220718063013.335531-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-18 13:10:47 -06:00
Yang Li 6b1439d203 ublk_drv: remove unneeded semicolon
Eliminate the following coccicheck warnings:
./drivers/block/ublk_drv.c:1467:2-3: Unneeded semicolon
./drivers/block/ublk_drv.c:1528:2-3: Unneeded semicolon

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220718015431.40185-1-yang.lee@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-18 13:10:24 -06:00
Yang Yingliang f50e5d670c ublk_drv: fix missing error return code in ublk_add_dev()
If blk_mq_init_queue() fails, it should return error code in ublk_add_dev()

Fixes: cebbe577cb ("ublk_drv: fix request queue leak")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220718042408.3132835-1-yangyingliang@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-18 13:10:22 -06:00
Ming Lei f2450f8a2c ublk_drv: fix build warning with -Wmaybe-uninitialized and one sparse warning
After applying -Wmaybe-uninitialized manually, two build warnings are
triggered:

drivers/block/ublk_drv.c:940:11: warning: ‘io’ may be used uninitialized [-Wmaybe-uninitialized]
  940 |         io->flags &= ~UBLK_IO_FLAG_ACTIVE;

drivers/block/ublk_drv.c: In function ‘ublk_ctrl_uring_cmd’:
drivers/block/ublk_drv.c:1531:9: warning: ‘ret’ may be used uninitialized [-Wmaybe-uninitialized]

Fix the 1st one by removing 'io->flags &= ~UBLK_IO_FLAG_ACTIVE;' which
isn't needed since the function always return successfully after setting
this flag.

Fix the 2nd one by always initializing 'ret'.

Also fix another sparse warning of 'sparse: sparse: incorrect type in return
expression' by changing return type of ublk_setup_iod().

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220716095344.222674-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-16 06:32:48 -06:00
Ming Lei cebbe577cb ublk_drv: fix request queue leak
Call blk_cleanup_queue() in release code path for fixing request
queue leak.

Also for-5.20/block has cleaned up blk_cleanup_queue(), which is
basically merged to del_gendisk() if blk_mq_alloc_disk() is used
for allocating disk and queue.

However, ublk may not add disk in case of starting device failure, then
del_gendisk() won't be called when removing ublk device, so blk_mq_exit_queue
will not be callsed, and it can be bit hard to deal with this kind of
merge conflict.

Turns out ublk's queue/disk use model is very similar with scsi, so switch
to scsi's model by allocating disk and queue independently, then it can be
quite easy to handle v5.20 merge conflict by replacing blk_cleanup_queue
with blk_mq_destroy_queue.

Reported-by: Jens Axboe <axboe@kernel.dk>
Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220714103201.131648-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 07:16:04 -06:00
Ming Lei 0edb3696c1 ublk_drv: support to complete io command via task_work_add
Use task_work_add if it is available, since task_work_add can bring
up better performance, especially batching signaling ->ubq_daemon can
be done.

It is observed that task_work_add() can boost iops by +4% on random
4k io test. Also except for completing io command, all other code
paths are same with completing io command via
io_uring_cmd_complete_in_task.

Meantime add one flag of UBLK_F_URING_CMD_COMP_IN_TASK for comparing
the mode easily.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220713140711.97356-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 07:15:48 -06:00
Ming Lei 71f28f3136 ublk_drv: add io_uring based userspace block driver
This is the driver part of userspace block driver(ublk driver), the other
part is userspace daemon part(ublksrv)[1].

The two parts communicate by io_uring's IORING_OP_URING_CMD with one
shared cmd buffer for storing io command, and the buffer is read only for
ublksrv, each io command is indexed by io request tag directly, and is
written by ublk driver.

For example, when one READ io request is submitted to ublk block driver,
ublk driver stores the io command into cmd buffer first, then completes
one IORING_OP_URING_CMD for notifying ublksrv, and the URING_CMD is issued
to ublk driver beforehand by ublksrv for getting notification of any new
io request, and each URING_CMD is associated with one io request by tag.

After ublksrv gets the io command, it translates and handles the ublk io
request, such as, for the ublk-loop target, ublksrv translates the request
into same request on another file or disk, like the kernel loop block
driver. In ublksrv's implementation, the io is still handled by io_uring,
and share same ring with IORING_OP_URING_CMD command. When the target io
request is done, the same IORING_OP_URING_CMD is issued to ublk driver for
both committing io request result and getting future notification of new
io request.

Another thing done by ublk driver is to copy data between kernel io
request and ublksrv's io buffer:

1) before ubsrv handles WRITE request, copy the request's data into
   ublksrv's userspace io buffer, so that ublksrv can handle the write
   request

2) after ubsrv handles READ request, copy ublksrv's userspace io buffer
   into this READ request, then ublk driver can complete the READ request

Zero copy may be switched if mm is ready to support it.

ublk driver doesn't handle any logic of the specific user space driver,
so it is small/simple enough.

[1] ublksrv

https://github.com/ming1/ubdsrv

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220713140711.97356-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 07:15:28 -06:00