Commit graph

277 commits

Author SHA1 Message Date
Breno Leitao
c1755c25a7 io_uring: Enable KASAN for request cache
Every io_uring request is represented by struct io_kiocb, which is
cached locally by io_uring (not SLAB/SLUB) in the list called
submit_state.freelist. This patch simply enabled KASAN for this free
list.

This list is initially created by KMEM_CACHE, but later, managed by
io_uring. This patch basically poisons the objects that are not used
(i.e., they are the free list), and unpoisons it when the object is
allocated/removed from the list.

Touching these poisoned objects while in the freelist will cause a KASAN
warning.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Breno Leitao
f30bd4d038 io_uring: Split io_issue_def struct
This patch removes some "cold" fields from `struct io_issue_def`.

The plan is to keep only highly used fields into `struct io_issue_def`, so,
it may be hot in the cache. The hot fields are basically all the bitfields
and the callback functions for .issue and .prep.

The other less frequently used fields are now located in a secondary and
cold struct, called `io_cold_def`.

This is the size for the structs:

Before: io_issue_def = 56 bytes
After: io_issue_def = 24 bytes; io_cold_def = 40 bytes

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20230112144411.2624698-2-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Breno Leitao
a7dd27828b io_uring: Rename struct io_op_def
The current io_op_def struct is becoming huge and the name is a bit
generic.

The goal of this patch is to rename this struct to `io_issue_def`. This
struct will contain the hot functions associated with the issue code
path.

For now, this patch only renames the structure, and an upcoming patch
will break up the structure in two, moving the non-issue fields to a
secondary struct.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20230112144411.2624698-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov
68a2cc1bba io_uring: refactor __io_req_complete_post
Keep parts of __io_req_complete_post() relying on req->flags together so
the value can be cached.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2b4fbb42f404a0e75c4d9f0a5b16f314a839d0a9.1673887636.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov
31f084b7b0 io_uring: simplify fallback execution
Lock the ring with uring_lock in io_fallback_req_func(), which should
make it a bit safer and easier. With that we also don't need refs
pinning as io_ring_exit_work() will wait until uring_lock is freed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/56170e6a0cbfc8edee2794c6613e8f6f1d76d276.1673887636.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov
89800a2dd5 io_uring: don't export io_put_task()
io_put_task() is only used in uring.c so enclose it there together with
__io_put_task().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/43c7f9227e2ab215f1a6069dadbc5382bed346fe.1673887636.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov
b0b7a7d24b io_uring: return back links tw run optimisation
io_submit_flush_completions() may queue new requests for tw execution,
especially true for linked requests. Recheck the tw list for emptiness
after flushing completions.

Note that this doesn't really fix the commit referenced below, but it
does reinstate an optimization that existed before that got merged.

Fixes: f88262e60b ("io_uring: lockless task list")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6328acdbb5e60efc762b18003382de077e6e1367.1673887636.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Quanfa Fu
88b80534f6 io_uring: make io_sqpoll_wait_sq return void
Change the return type to void since it always return 0, and no need
to do the checking in syscall io_uring_enter.

Signed-off-by: Quanfa Fu <quanfafu@gmail.com>
Link: https://lore.kernel.org/r/20230115071519.554282-1-quanfafu@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
c3f4d39ee4 io_uring: optimise deferred tw execution
We needed fake nodes in __io_run_local_work() and to avoid unecessary wake
ups while the task already running task_works, but we don't need them
anymore since wake ups are protected by cq_waiting, which is always
cleared by the time we're executing deferred task_work items.

Note that because of loose sync around cq_waiting clearing
io_req_local_work_add() may wake the task more than once, but that's
fine and should be rare to not hurt perf.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8839534891f0a2f1076e78554a31ea7e099f7de5.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
d80c0f00d0 io_uring: add io_req_local_work_add wake fast path
Don't wake the master task after queueing a deferred tw unless it's
currently waiting in io_cqring_wait.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/717702d772825a6647e6c315b4690277ba84c3fc.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
130bd686d9 io_uring: waitqueue-less cq waiting
With DEFER_TASKRUN only ctx->submitter_task might be waiting for CQEs,
we can use this to optimise io_cqring_wait(). Replace ->cq_wait
waitqueue with waking the task directly.

It works but misses an important optimisation covered by the following
patch, so this patch without follow ups might hurt performance.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/103d174d35d919d4cb0922d8a9c93a8f0c35f74a.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
3181e22fb7 io_uring: wake up optimisations
Flush completions is done either from the submit syscall or by the
task_work, both are in the context of the submitter task, and when it
goes for a single threaded rings like implied by ->task_complete, there
won't be any waiters on ->cq_wait but the master task. That means that
there can be no tasks sleeping on cq_wait while we run
__io_submit_flush_completions() and so waking up can be skipped.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/60ad9768ec74435a0ddaa6eec0ffa7729474f69f.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
bca39f3905 io_uring: add lazy poll_wq activation
Even though io_poll_wq_wake()'s waitqueue_active reuses a barrier we do
for another waitqueue, it's not going to be the case in the future and
so we want to have a fast path for it when the ring has never been
polled.

Move poll_wq wake ups into __io_commit_cqring_flush() using a new flag
called ->poll_activated. The idea behind the flag is to set it when the
ring was polled for the first time. This requires additional sync to not
miss events, which is done here by using task_work for ->task_complete
rings, and by default enabling the flag for all other types of rings.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/060785e8e9137a920b232c0c7f575b131af19cac.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
7b235dd82a io_uring: separate wq for ring polling
Don't use ->cq_wait for ring polling but add a separate wait queue for
it. We need it for following patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dea0be0bf990503443c5c6c337fc66824af7d590.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
360173ab9e io_uring: move io_run_local_work_locked
io_run_local_work_locked() is only used in io_uring.c, move it there.
With that we can also make __io_run_local_work() static.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/91757bcb33e5774e49fed6f2b6e058630608119b.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
3e5655552a io_uring: mark io_run_local_work static
io_run_local_work is enclosed in io_uring.c, we don't need to export it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b477fb81f5e77044f724a06fe245d5c078659364.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
2f413956cc io_uring: don't set TASK_RUNNING in local tw runner
The CQ waiting loop sets TASK_RUNNING before trying to execute
task_work, no need to repeat it in io_run_local_work().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9d9422c429ef3f9457b4f4b8288bf4789564f33b.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
bd550173ac io_uring: refactor io_wake_function
Remove a local variable ctx in io_wake_function(), we don't need it if
io_should_wake() triggers it to wake up.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e60eb1008aebe286aab7d34c772ed01c447bddb1.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
d33a39e577 io_uring: keep timeout in io_wait_queue
Move waiting timeout into io_wait_queue

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e4b48a9e26a3b1cf97c80121e62d4b5ab873d28d.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
46ae7eef44 io_uring: optimise non-timeout waiting
Unlike the jiffy scheduling version, schedule_hrtimeout() jumps a few
functions before getting into schedule() even if there is no actual
timeout needed. Some tests showed that it takes up to 1% of CPU cycles.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/89f880574eceee6f4899783377ead234df7b3d04.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
326a9e482e io_uring: set TASK_RUNNING right after schedule
Instead of constantly watching that the state of the task is running
before executing tw or taking locks in io_cqring_wait(), switch it back
to TASK_RUNNING immediately.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/246dddee247d89fd52023f785ed17cc34962a008.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
490c00eb4f io_uring: simplify io_has_work
->work_llist should never be non-empty for a non DEFER_TASKRUN ring, so
we can safely skip checking the flag.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/26af9f73c09a56c9a035f94db56127358688f3aa.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
846072f16e io_uring: mimimise io_cqring_wait_schedule
io_cqring_wait_schedule() is called after we started waiting on the cq
wq and set the state to TASK_INTERRUPTIBLE, for that reason we have to
constantly worry whether we has returned the state back to running or
not. Leave only quick checks in io_cqring_wait_schedule() and move the
rest including running task work to the callers. Note, we run tw in the
loop after the sched checks because of the fast path in the beginning of
the function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2814fabe75e2e019e7ca43ea07daa94564349805.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov
3fcf19d592 io_uring: parse check_cq out of wq waiting
We already avoid flushing overflows in io_cqring_wait_schedule() but
only return an error for the outer loop to handle it. Minimise it even
further by moving all ->check_cq parsing there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9dfcec3121013f98208dbf79368d636d74e1231a.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:39 -07:00
Pavel Begunkov
140102ae9a io_uring: move defer tw task checks
Most places that want to run local tw explicitly and in advance check if
they are allowed to do so. Don't rely on a similar check in
__io_run_local_work(), leave it as a just-in-case warning and make sure
callers checks capabilities themselves.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/990fe0e8e70fd4d57e43625e5ce8fba584821d1a.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:39 -07:00
Pavel Begunkov
1414d62985 io_uring: kill io_run_task_work_ctx
There is only one user of io_run_task_work_ctx(), inline it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/40953c65f7c88fb00cdc4d870ca5d5319fb3d7ea.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:39 -07:00
Pavel Begunkov
f36ba6cf1a io_uring: don't iterate cq wait fast path
Task work runners keep running until all queues tw items are exhausted.
It's also rare for defer tw to queue normal tw and vise versa. Taking it
into account, there is only a dim chance that further iterating the
io_cqring_wait() fast path will get us anything and so we can remove
the loop there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1f9565726661266abaa5d921e97433c831759ecf.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:39 -07:00
Pavel Begunkov
0c4fe008c9 io_uring: rearrange defer list checks
There should be nothing in the ->work_llist for non DEFER_TASKRUN rings,
so we can skip flag checks and test the list emptiness directly. Also
move it out of io_run_local_work() for inlining.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/331d63fd15ca79b35b95c82a82d9246110686392.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:39 -07:00
Dylan Yudaken
ef5c600adb io_uring: always prep_async for drain requests
Drain requests all go through io_drain_req, which has a quick exit in case
there is nothing pending (ie the drain is not useful). In that case it can
run the issue the request immediately.

However for safety it queues it through task work.
The problem is that in this case the request is run asynchronously, but
the async work has not been prepared through io_req_prep_async.

This has not been a problem up to now, as the task work always would run
before returning to userspace, and so the user would not have a chance to
race with it.

However - with IORING_SETUP_DEFER_TASKRUN - this is no longer the case and
the work might be defered, giving userspace a chance to change data being
referred to in the request.

Instead _always_ prep_async for drain requests, which is simpler anyway
and removes this issue.

Cc: stable@vger.kernel.org
Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20230127105911.2420061-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-27 06:29:29 -07:00
Pavel Begunkov
8579538c89 io_uring/msg_ring: fix remote queue to disabled ring
IORING_SETUP_R_DISABLED rings don't have the submitter task set, so
it's not always safe to use ->submitter_task. Disallow posting msg_ring
messaged to disabled rings. Also add task NULL check for loosy sync
around testing for IORING_SETUP_R_DISABLED.

Cc: stable@vger.kernel.org
Fixes: 6d043ee116 ("io_uring: do msg_ring in target task via tw")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-20 09:49:34 -07:00
David Hildenbrand
fc4f4be9b5 mm/nommu: factor out check for NOMMU shared mappings into is_nommu_shared_mapping()
Patch series "mm/nommu: don't use VM_MAYSHARE for MAP_PRIVATE mappings".

Trying to reduce the confusion around VM_SHARED and VM_MAYSHARE first
requires !CONFIG_MMU to stop using VM_MAYSHARE for MAP_PRIVATE mappings. 
CONFIG_MMU only sets VM_MAYSHARE for MAP_SHARED mappings.

This paves the way for further VM_MAYSHARE and VM_SHARED cleanups: for
example, renaming VM_MAYSHARED to VM_MAP_SHARED to make it cleaner what is
actually means.

Let's first get the weird case out of the way and not use VM_MAYSHARE in
MAP_PRIVATE mappings, using a new VM_MAYOVERLAY flag instead.


This patch (of 3):

We want to stop using VM_MAYSHARE in private mappings to pave the way for
clarifying the semantics of VM_MAYSHARE vs.  VM_SHARED and reduce the
confusion.  While CONFIG_MMU uses VM_MAYSHARE to represent MAP_SHARED,
!CONFIG_MMU also sets VM_MAYSHARE for selected R/O private file mappings
that are an effective overlay of a file mapping.

Let's factor out all relevant VM_MAYSHARE checks in !CONFIG_MMU code into
is_nommu_shared_mapping() first.

Note that whenever VM_SHARED is set, VM_MAYSHARE must be set as well
(unless there is a serious BUG).  So there is not need to test for
VM_SHARED manually.

No functional change intended.

Link: https://lkml.kernel.org/r/20230102160856.500584-1-david@redhat.com
Link: https://lkml.kernel.org/r/20230102160856.500584-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:56 -08:00
Pavel Begunkov
12521a5d5c io_uring: fix CQ waiting timeout handling
Jiffy to ktime CQ waiting conversion broke how we treat timeouts, in
particular we rearm it anew every time we get into
io_cqring_wait_schedule() without adjusting the timeout. Waiting for 2
CQEs and getting a task_work in the middle may double the timeout value,
or even worse in some cases task may wait indefinitely.

Cc: stable@vger.kernel.org
Fixes: 228339662b ("io_uring: don't convert to jiffies for waiting on timeouts")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f7bffddd71b08f28a877d44d37ac953ddb01590d.1672915663.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-05 08:04:47 -07:00
Pavel Begunkov
f26cc95935 io_uring: lockdep annotate CQ locking
Locking around CQE posting is complex and depends on options the ring is
created with, add more thorough lockdep annotations checking all
invariants.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/aa3770b4eacae3915d782cc2ab2f395a99b4b232.1672795976.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-03 19:05:41 -07:00
Pavel Begunkov
9ffa13ff78 io_uring: pin context while queueing deferred tw
Unlike normal tw, nothing prevents deferred tw to be executed right
after an tw item added to ->work_llist in io_req_local_work_add(). For
instance, the waiting task may get waken up by CQ posting or a normal
tw. Thus we need to pin the ring for the rest of io_req_local_work_add()

Cc: stable@vger.kernel.org
Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1a79362b9c10b8523ef70b061d96523650a23344.1672795998.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-03 19:03:28 -07:00
Jens Axboe
343190841a io_uring: check for valid register opcode earlier
We only check the register opcode value inside the restricted ring
section, move it into the main io_uring_register() function instead
and check it up front.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-23 06:40:32 -07:00
Jens Axboe
52ea806ad9 io_uring: finish waiting before flushing overflow entries
If we have overflow entries being generated after we've done the
initial flush in io_cqring_wait(), then we could be flushing them in the
main wait loop as well. If that's done after having added ourselves
to the cq_wait waitqueue, then the task state can be != TASK_RUNNING
when we enter the overflow flush.

Check for the need to overflow flush, and finish our wait cycle first
if we have to do so.

Reported-and-tested-by: syzbot+cf6ea1d6bb30a4ce10b2@syzkaller.appspotmail.com
Link: https://lore.kernel.org/io-uring/000000000000cb143a05f04eee15@google.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-21 08:43:53 -07:00
Jens Axboe
35d90f95cf io_uring: include task_work run after scheduling in wait for events
It's quite possible that we got woken up because task_work was queued,
and we need to process this task_work to generate the events waited for.
If we return to the wait loop without running task_work, we'll end up
adding the task to the waitqueue again, only to call
io_cqring_wait_schedule() again which will run the task_work. This is
less efficient than it could be, as it requires adding to the cq_wait
queue again. It also triggers the wakeup path for completions as
cq_wait is now non-empty with the task itself, and it'll require another
lock grab and deletion to remove ourselves from the waitqueue.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-17 20:35:54 -07:00
Dylan Yudaken
44a84da452 io_uring: use call_rcu_hurry if signaling an eventfd
io_uring uses call_rcu in the case it needs to signal an eventfd as a
result of an eventfd signal, since recursing eventfd signals are not
allowed. This should be calling the new call_rcu_hurry API to not delay
the signal.

Signed-off-by: Dylan Yudaken <dylany@meta.com>

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20221215184138.795576-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-15 11:59:29 -07:00
Pavel Begunkov
a8cf95f936 io_uring: fix overflow handling regression
Because the single task locking series got reordered ahead of the
timeout and completion lock changes, two hunks inadvertently ended up
using __io_fill_cqe_req() rather than io_fill_cqe_req(). This meant
that we dropped overflow handling in those two spots. Reinstate the
correct CQE filling helper.

Fixes: f66f73421f ("io_uring: skip spinlocking for ->task_complete")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-15 08:20:10 -07:00
Pavel Begunkov
e5f30f6fb2 io_uring: ease timeout flush locking requirements
We don't need completion_lock for timeout flushing, don't take it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1e3dc657975ac445b80e7bdc40050db783a5935a.1670002973.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14 08:53:35 -07:00
Pavel Begunkov
6971253f07 io_uring: revise completion_lock locking
io_kill_timeouts() doesn't post any events but queues everything to
task_work. Locking there is needed for protecting linked requests
traversing, we should grab completion_lock directly instead of using
io_cq_[un]lock helpers. Same goes for __io_req_find_next_prep().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/88e75d481a65dc295cb59722bb1cf76402d1c06b.1670002973.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14 08:53:04 -07:00
Linus Torvalds
96f7e448b9 for-6.2/io_uring-next-2022-12-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOSZJAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpp+OD/9M04tGsVFCdqtKty5lBlDs03OnA03f404c
 Ottp2pop2JrBx7+ycS/dl78MQjmghh8Ectel8kOiswDeRQc98TtzWY31DF1d56yS
 aGYCq2Ww2gx5ziYmJgiFU7RRLTFlpfa/vZUBMK4HW4MYm2ihxtfNc72Oa8H9KGDJ
 /RYk5+PSCau+UFwyWu91rORVNRXjLr1mFmgzRTmFhL2unYYuOO83mK4GpK2f8rHx
 qpT7Wn9IS9xiTpr8rHqs8y6rxV6+Tnv/HqR8kKoviHvQU/u6fzvKSNEu1NvK/Znm
 V0z8cI4JJZUelDyqJe5ITq8cIS59amzILEIneclYQd20NkqcFYPlS56K6qR9qL7J
 6eNHvgH7iKvnk9JlR2soKojC6KWEPtVni2BjPEXgXHrfUWdINMKrT6MwTNOjztWj
 h+EaqLBGQb5m/nRCCMeE9kfUK23Rg6Ev+H+aas0SgqD5Isg/hVG+aMtjWLmWquCU
 pKg2UqxqsR9ymKj92KJSoN7F8Z2U0JsoHBKzAammlnmfhxl294RjGqBMJjjI5eS9
 Zu+fTte4EuFY5s/TvE5FCBmQ0Oeg+ud2f+GKXDzF25equtct8QCfHIbguM6Yr3X2
 3ANYGtLwP7Cj96U1Y++RWgTpBTTKwGkWyzEyS9SN/+MRXhIjeP8JZeGdXBDWaXLC
 Vp7j39Avgg==
 =MzhS
 -----END PGP SIGNATURE-----

Merge tag 'for-6.2/io_uring-next-2022-12-08' of git://git.kernel.dk/linux

Pull io_uring updates part two from Jens Axboe:

 - Misc fixes (me, Lin)

 - Series from Pavel extending the single task exclusive ring mode,
   yielding nice improvements for the common case of having a single
   ring per thread (Pavel)

 - Cleanup for MSG_RING, removing our IOPOLL hack (Pavel)

 - Further poll cleanups and fixes (Pavel)

 - Misc cleanups and fixes (Pavel)

* tag 'for-6.2/io_uring-next-2022-12-08' of git://git.kernel.dk/linux: (22 commits)
  io_uring/msg_ring: flag target ring as having task_work, if needed
  io_uring: skip spinlocking for ->task_complete
  io_uring: do msg_ring in target task via tw
  io_uring: extract a io_msg_install_complete helper
  io_uring: get rid of double locking
  io_uring: never run tw and fallback in parallel
  io_uring: use tw for putting rsrc
  io_uring: force multishot CQEs into task context
  io_uring: complete all requests in task context
  io_uring: don't check overflow flush failures
  io_uring: skip overflow CQE posting for dying ring
  io_uring: improve io_double_lock_ctx fail handling
  io_uring: dont remove file from msg_ring reqs
  io_uring: reshuffle issue_flags
  io_uring: don't reinstall quiesce node for each tw
  io_uring: improve rsrc quiesce refs checks
  io_uring: don't raw spin unlock to match cq_lock
  io_uring: combine poll tw handlers
  io_uring: improve poll warning handling
  io_uring: remove ctx variable in io_poll_check_events
  ...
2022-12-13 10:40:31 -08:00
Linus Torvalds
54e60e505d for-6.2/io_uring-2022-12-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOSYp4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgWNEADTcHejQ5y8Ctc1BEsCiEnCbR5DeIwESBKN
 SfK7uK5N5GjKZgKeMtkdxNeckH2wSZ79VFpz2/b4fszM11H+P81r/PP+OwQojcra
 1YyPbp2jXlQZGEYijiXYDohnKr8NnJYj/nuwm4T73OPOD48ekbrY36/t3Hd75jKk
 M5L/HOpbKhA+fypPHYlm3XlwEM9/4wupsDeiabTuAeFpLh66V/h85ZLY91c3Bf7j
 yllzT2CCGQsh+fnqdovpsqUxM4Sh6sHfa2QhwkfJFfU9OLwDz0TLlcvSVwWWICD1
 DGeDE/MPBKZ4Z5zp8t94vTnZDAiExwkZbmCOnOkdYoPdMDqCDhp8a9WywV3IiJJi
 +hRN6Au6BWw8Rj3b3pbobfdXFJAvoHqXHM1SDAsWmEIMpuMhNalIpYqtna9JwRDe
 KV2ERMpGOUpmvQiRaJa5Wtz6hlFetmuav6Ij9DWb5f8+NXikiFAXk4g6TtD1Z7Cb
 15klwM8qW50zX7YUlzBLcjNdj7+HuIswLx0obZ3Uv1ogDwMQKXEvPaAHUwVN02q6
 cWP0P2Bc0EKvdwTSNpgNflhNsiV2DFJZpZfxkdPauN4t2EkJ38iEpHewffy0ecM7
 uyaXGQVQW7WFDuGno4cGbThQIG95MqkRPBhEyB4cOmcyS/aZ92ZFtV1iI8dVDA+v
 uuEIMc3OCA==
 =EgDc
 -----END PGP SIGNATURE-----

Merge tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Always ensure proper ordering in case of CQ ring overflow, which then
   means we can remove some work-arounds for that (Dylan)

 - Support completion batching for multishot, greatly increasing the
   efficiency for those (Dylan)

 - Flag epoll/eventfd wakeups done from io_uring, so that we can easily
   tell if we're recursing into io_uring again.

   Previously, this would have resulted in repeated multishot
   notifications if we had a dependency there. That could happen if an
   eventfd was registered as the ring eventfd, and we multishot polled
   for events on it. Or if an io_uring fd was added to epoll, and
   io_uring had a multishot request for the epoll fd.

   Test cases here:
	https://git.kernel.dk/cgit/liburing/commit/?id=919755a7d0096fda08fb6d65ac54ad8d0fe027cd

   Previously these got terminated when the CQ ring eventually
   overflowed, now it's handled gracefully (me).

 - Tightening of the IOPOLL based completions (Pavel)

 - Optimizations of the networking zero-copy paths (Pavel)

 - Various tweaks and fixes (Dylan, Pavel)

* tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linux: (41 commits)
  io_uring: keep unlock_post inlined in hot path
  io_uring: don't use complete_post in kbuf
  io_uring: spelling fix
  io_uring: remove io_req_complete_post_tw
  io_uring: allow multishot polled reqs to defer completion
  io_uring: remove overflow param from io_post_aux_cqe
  io_uring: add lockdep assertion in io_fill_cqe_aux
  io_uring: make io_fill_cqe_aux static
  io_uring: add io_aux_cqe which allows deferred completion
  io_uring: allow defer completion for aux posted cqes
  io_uring: defer all io_req_complete_failed
  io_uring: always lock in io_apoll_task_func
  io_uring: remove iopoll spinlock
  io_uring: iopoll protect complete_post
  io_uring: inline __io_req_complete_put()
  io_uring: remove io_req_tw_post_queue
  io_uring: use io_req_task_complete() in timeout
  io_uring: hold locks for io_req_complete_failed
  io_uring: add completion locking for iopoll
  io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()
  ...
2022-12-13 10:33:08 -08:00
Pavel Begunkov
f66f73421f io_uring: skip spinlocking for ->task_complete
->task_complete was added to serialised CQE posting by doing it from
the task context only (or fallback wq when the task is dead), and now we
can use that to avoid taking ->completion_lock while filling CQ entries.
The patch skips spinlocking only in two spots,
__io_submit_flush_completions() and flushing in io_aux_cqe, it's safer
and covers all cases we care about. Extra care is taken to force taking
the lock while queueing overflow entries.

It fundamentally relies on SINGLE_ISSUER to have only one task posting
events. It also need to take into account overflowed CQEs, flushing of
which happens in the cq wait path, and so this implementation also needs
DEFER_TASKRUN to limit waiters. For the same reason we disable it for
SQPOLL, and for IOPOLL as it won't benefit from it in any case.
DEFER_TASKRUN, SQPOLL and IOPOLL requirement may be relaxed in the
future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2a8c91fd82cfcdcc1d2e5bac7051fe2c183bda73.1670384893.git.asml.silence@gmail.com
[axboe: modify to apply]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 08:51:08 -07:00
Pavel Begunkov
77e443ab29 io_uring: never run tw and fallback in parallel
Once we fallback a tw we want all requests to that task to be given to
the fallback wq so we dont run it in parallel with the last, i.e. post
PF_EXITING, tw run of the task.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/96f4987265c4312f376f206511c6af3e77aaf5ac.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Pavel Begunkov
d34b1b0b67 io_uring: use tw for putting rsrc
Use task_work for completing rsrc removals, it'll be needed later for
spinlock optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cbba5d53a11ee6fc2194dacea262c1d733c8b529.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Pavel Begunkov
e6aeb2721d io_uring: complete all requests in task context
This patch adds ctx->task_complete flag. If set, we'll complete all
requests in the context of the original task. Note, this extends to
completion CQE posting only but not io_kiocb cleanup / free, e.g. io-wq
may free the requests in the free calllback. This flag will be used
later for optimisations purposes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/21ece72953f76bb2e77659a72a14326227ab6460.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Pavel Begunkov
1b346e4aa8 io_uring: don't check overflow flush failures
The only way to fail overflowed CQEs flush is for CQ to be fully packed.
There is one place checking for flush failures, i.e. io_cqring_wait(),
but we limit the number to be waited for by the CQ size, so getting a
failure automatically means that we're done with waiting.

Don't check for failures, rarely but they might spuriously fail CQ
waiting with -EBUSY.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6b720a45c03345655517f8202cbd0bece2848fb2.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Pavel Begunkov
a85381d832 io_uring: skip overflow CQE posting for dying ring
After io_ring_ctx_wait_and_kill() is called there should be no users
poking into rings and so there is no need to post CQEs. So, instead of
trying to post overflowed CQEs into the CQ, drop them. Also, do it
in io_ring_exit_work() in a loop to reduce the number of contexts it
can be executed from and even when it struggles to quiesce the ring we
won't be leaving memory allocated for longer than needed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/26d13751155a735a3029e24f8d9ca992f810419d.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Pavel Begunkov
ef0ec1ad03 io_uring: dont remove file from msg_ring reqs
We should not be messing with req->file outside of core paths. Clearing
it makes msg_ring non reentrant, i.e. luckily io_msg_send_fd() fails the
request on failed io_double_lock_ctx() but clearly was originally
intended to do retries instead.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e5ac9edadb574fe33f6d727cb8f14ce68262a684.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Harshit Mogalapalli
998b30c394 io_uring: Fix a null-ptr-deref in io_tctx_exit_cb()
Syzkaller reports a NULL deref bug as follows:

 BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3
 Read of size 4 at addr 0000000000000138 by task file1/1955

 CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0xcd/0x134
  ? io_tctx_exit_cb+0x53/0xd3
  kasan_report+0xbb/0x1f0
  ? io_tctx_exit_cb+0x53/0xd3
  kasan_check_range+0x140/0x190
  io_tctx_exit_cb+0x53/0xd3
  task_work_run+0x164/0x250
  ? task_work_cancel+0x30/0x30
  get_signal+0x1c3/0x2440
  ? lock_downgrade+0x6e0/0x6e0
  ? lock_downgrade+0x6e0/0x6e0
  ? exit_signals+0x8b0/0x8b0
  ? do_raw_read_unlock+0x3b/0x70
  ? do_raw_spin_unlock+0x50/0x230
  arch_do_signal_or_restart+0x82/0x2470
  ? kmem_cache_free+0x260/0x4b0
  ? putname+0xfe/0x140
  ? get_sigframe_size+0x10/0x10
  ? do_execveat_common.isra.0+0x226/0x710
  ? lockdep_hardirqs_on+0x79/0x100
  ? putname+0xfe/0x140
  ? do_execveat_common.isra.0+0x238/0x710
  exit_to_user_mode_prepare+0x15f/0x250
  syscall_exit_to_user_mode+0x19/0x50
  do_syscall_64+0x42/0xb0
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0023:0x0
 Code: Unable to access opcode bytes at 0xffffffffffffffd6.
 RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  </TASK>
 Kernel panic - not syncing: panic_on_warn set ...

This happens because the adding of task_work from io_ring_exit_work()
isn't synchronized with canceling all work items from eg exec. The
execution of the two are ordered in that they are both run by the task
itself, but if io_tctx_exit_cb() is queued while we're canceling all
work items off exec AND gets executed when the task exits to userspace
rather than in the main loop in io_uring_cancel_generic(), then we can
find current->io_uring == NULL and hit the above crash.

It's safe to add this NULL check here, because the execution of the two
paths are done by the task itself.

Cc: stable@vger.kernel.org
Fixes: d56d938b4b ("io_uring: do ctx initiated file note removal")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Link: https://lore.kernel.org/r/20221206093833.3812138-1-harshit.m.mogalapalli@oracle.com
[axboe: add code comment and also put an explanation in the commit msg]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:45:20 -07:00
Pavel Begunkov
618d653a34 io_uring: don't raw spin unlock to match cq_lock
There is one newly added place when we lock ring with io_cq_lock() but
unlocking is hand coded calling spin_unlock directly. It's ugly and
troublesome in the long run. Make it consistent with the other completion
locking.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4ca4f0564492b90214a190cd5b2a6c76522de138.1669821213.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30 10:28:49 -07:00
Pavel Begunkov
f6f7f903e7 io_uring: kill io_poll_issue's PF_EXITING check
We don't need to worry about checking PF_EXITING in io_poll_issue().
task works using the function should take care of it and never try to
resubmit / retry if the task is dying.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2e9dc998dc07507c759a0c9cb5d2fbea0710d58c.1669821213.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30 10:26:57 -07:00
Pavel Begunkov
5d77291685 io_uring: keep unlock_post inlined in hot path
This partially reverts

6c16fe3c16 ("io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()")

The redundancy of __io_cq_unlock_post() was always to keep it inlined
into __io_submit_flush_completions(). Inline it back and rename with
hope of clarifying the intention behind it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/372a16c485fca44c069be2e92fc5e7332a1d7fd7.1669310258.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:11:15 -07:00
Dylan Yudaken
10d8bc3541 io_uring: spelling fix
s/pushs/pushes/

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221125103412.1425305-3-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:46 -07:00
Dylan Yudaken
27f35fe909 io_uring: remove io_req_complete_post_tw
It's only used in one place. Inline it.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221125103412.1425305-2-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:46 -07:00
Dylan Yudaken
9a6924519e io_uring: allow multishot polled reqs to defer completion
Until now there was no reason for multishot polled requests to defer
completions as there was no functional difference. However now this will
actually defer the completions, for a performance win.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-10-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Dylan Yudaken
b529c96a89 io_uring: remove overflow param from io_post_aux_cqe
The only call sites which would not allow overflow are also call sites
which would use the io_aux_cqe as they care about ordering.

So remove this parameter from io_post_aux_cqe.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-9-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Dylan Yudaken
2e2ef4a1da io_uring: add lockdep assertion in io_fill_cqe_aux
Add an assertion for the completion lock to io_fill_cqe_aux

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-8-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Dylan Yudaken
a77ab745f2 io_uring: make io_fill_cqe_aux static
This is only used in io_uring.c

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-7-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Dylan Yudaken
9b8c54755a io_uring: add io_aux_cqe which allows deferred completion
Use the just introduced deferred post cqe completion state when possible
in io_aux_cqe. If not possible fallback to io_post_aux_cqe.

This introduces a complication because of allow_overflow. For deferred
completions we cannot know without locking the completion_lock if it will
overflow (and even if we locked it, another post could sneak in and cause
this cqe to be in overflow).
However since overflow protection is mostly a best effort defence in depth
to prevent infinite loops of CQEs for poll, just checking the overflow bit
is going to be good enough and will result in at most 16 (array size of
deferred cqes) overflows.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-6-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Dylan Yudaken
931147ddfa io_uring: allow defer completion for aux posted cqes
Multishot ops cannot use the compl_reqs list as the request must stay in
the poll list, but that means they need to run each completion without
benefiting from batching.

Here introduce batching infrastructure for only small (ie 16 byte)
CQEs. This restriction is ok because there are no use cases posting 32
byte CQEs.

In the ring keep a batch of up to 16 posted results, and flush in the same
way as compl_reqs.

16 was chosen through experimentation on a microbenchmark ([1]), as well
as trying not to increase the size of the ring too much. This increases
the size to 1472 bytes from 1216.

[1]: 9ac66b36bc
Run with $ make -j && ./benchmark/reg.b -s 1 -t 2000 -r 10
Gives results:
baseline	8309 k/s
8		18807 k/s
16		19338 k/s
32		20134 k/s

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-5-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Dylan Yudaken
973fc83f3a io_uring: defer all io_req_complete_failed
All failures happen under lock now, and can be deferred. To be consistent
when the failure has happened after some multishot cqe has been
deferred (and keep ordering), always defer failures.

To make this obvious at the caller (and to help prevent a future bug)
rename io_req_complete_failed to io_req_defer_failed.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-4-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Pavel Begunkov
1bec951c38 io_uring: iopoll protect complete_post
io_req_complete_post() may be used by iopoll enabled rings, grab locks
in this case. That requires to pass issue_flags to propagate the locking
state.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cc6d854065c57c838ca8e8806f707a226b70fd2d.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23 10:45:31 -07:00
Pavel Begunkov
fa18fa2272 io_uring: inline __io_req_complete_put()
Inline __io_req_complete_put() into io_req_complete_post(), there are no
other users.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1923a4dfe80fa877f859a22ed3df2d5fc8ecf02b.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23 10:44:01 -07:00
Pavel Begunkov
833b5dfffc io_uring: remove io_req_tw_post_queue
Remove io_req_tw_post() and io_req_tw_post_queue(), we can use
io_req_task_complete() instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b9b73c08022c7f1457023ac841f35c0100e70345.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23 10:44:00 -07:00
Pavel Begunkov
e276ae344a io_uring: hold locks for io_req_complete_failed
A preparation patch, make sure we always hold uring_lock around
io_req_complete_failed(). The only place deviating from the rule
is io_cancel_defer_files(), queue a tw instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/70760344eadaecf2939287084b9d4ba5c05a6984.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23 10:44:00 -07:00
Jens Axboe
6c16fe3c16 io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()
__io_cq_unlock_post() is identical to io_cq_unlock_post(), and
io_cqring_ev_posted() has a single caller so migth as well just inline
it there.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-22 06:09:30 -07:00
Jens Axboe
4464853277 io_uring: pass in EPOLL_URING_WAKE for eventfd signaling and wakeups
Pass in EPOLL_URING_WAKE when signaling eventfd or doing poll related
wakups, so that we can check for a circular event dependency between
eventfd and epoll. If this flag is set when our wakeup handlers are
called, then we know we have a dependency that needs to terminate
multishot requests.

eventfd and epoll are the only such possible dependencies.

Cc: stable@vger.kernel.org # 6.0
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-22 06:08:31 -07:00
Pavel Begunkov
f9d567c75e io_uring: inline __io_req_complete_post()
There is only one user of __io_req_complete_post(), inline it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ef4c9059950a3da5cf68df00f977f1fd13bd9306.1668597569.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:45:19 -07:00
Pavel Begunkov
d759360620 io_uring: split tw fallback into a function
When the target process is dying and so task_work_add() is not allowed
we push all task_work item to the fallback workqueue. Move the part
responsible for moving tw items out of __io_req_task_work_add() into
a separate function. Makes it a bit cleaner and gives the compiler a bit
of extra info.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e503dab9d7af95470ca6b214c6de17715ae4e748.1668162751.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:44:21 -07:00
Pavel Begunkov
e52d2e583e io_uring: inline io_req_task_work_add()
__io_req_task_work_add() is huge but marked inline, that makes compilers
to generate lots of garbage. Inline the wrapper caller
io_req_task_work_add() instead.

before and after:
   text    data     bss     dec     hex filename
  47347   16248       8   63603    f873 io_uring/io_uring.o
   text    data     bss     dec     hex filename
  45303   16248       8   61559    f077 io_uring/io_uring.o

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/26dc8c28ca0160e3269ef3e55c5a8b917c4d4450.1668162751.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:44:18 -07:00
Lin Ma
23a6c9ac4d io_uring: update outdated comment of callbacks
Previous commit ebc11b6c6b ("io_uring: clean io-wq callbacks") rename
io_free_work() into io_wq_free_work() for consistency. This patch also
updates relevant comment to avoid misunderstanding.

Fixes: ebc11b6c6b ("io_uring: clean io-wq callbacks")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Link: https://lore.kernel.org/r/20221110122103.20120-1-linma@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:44:16 -07:00
Dylan Yudaken
ef67fcb41d io_uring: do not always force run task_work in io_uring_register
Running task work when not needed can unnecessarily delay
operations. Specifically IORING_SETUP_DEFER_TASKRUN tries to avoid running
task work until the user requests it. Therefore do not run it in
io_uring_register any more.

The one catch is that io_rsrc_ref_quiesce expects it to have run in order
to process all outstanding references, and so reorder it's loop to do this.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221107123349.4106213-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:38:54 -07:00
Pavel Begunkov
3671163beb io_uring: move kbuf put out of generic tw complete
There are multiple users of io_req_task_complete() including zc
notifications, but only read requests use selected buffers. As we
already have an rw specific tw function, move io_put_kbuf() in there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/94374c7649aaefc3a17808dc4701f25ccd457e25.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:38:31 -07:00
Pavel Begunkov
9148286476 io_uring: fix multishot accept request leaks
Having REQ_F_POLLED set doesn't guarantee that the request is
executed as a multishot from the polling path. Fortunately for us, if
the code thinks it's multishot issue when it's not, it can only ask to
skip completion so leaking the request. Use issue_flags to mark
multipoll issues.

Cc: stable@vger.kernel.org
Fixes: 390ed29b5e ("io_uring: add IORING_ACCEPT_MULTISHOT for accept")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7700ac57653f2823e30b34dc74da68678c0c5f13.1668710222.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-17 12:33:33 -07:00
Dylan Yudaken
0fc8c2acbf io_uring: calculate CQEs from the user visible value
io_cqring_wait (and it's wake function io_has_work) used cached_cq_tail in
order to calculate the number of CQEs. cached_cq_tail is set strictly
before the user visible rings->cq.tail

However as far as userspace is concerned,  if io_uring_enter(2) is called
with a minimum number of events, they will verify by checking
rings->cq.tail.

It is therefore possible for io_uring_enter(2) to return early with fewer
events visible to the user.

Instead make the wait functions read from the user visible value, so there
will be no discrepency.

This is triggered eventually by the following reproducer:

struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
unsigned int cqe_ready;
struct io_uring ring;
int ret, i;

ret = io_uring_queue_init(N, &ring, 0);
assert(!ret);
while(true) {
	for (i = 0; i < N; i++) {
		sqe = io_uring_get_sqe(&ring);
		io_uring_prep_nop(sqe);
		sqe->flags |= IOSQE_ASYNC;
	}
	ret = io_uring_submit(&ring);
	assert(ret == N);

	do {
		ret = io_uring_wait_cqes(&ring, &cqe, N, NULL, NULL);
	} while(ret == -EINTR);
	cqe_ready = io_uring_cq_ready(&ring);
	assert(!ret);
	assert(cqe_ready == N);
	io_uring_cq_advance(&ring, N);
}

Fixes: ad3eb2c89f ("io_uring: split overflow state into SQ and CQ side")
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221108153016.1854297-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-08 10:36:15 -07:00
Dylan Yudaken
b3026767e1 io_uring: unlock if __io_run_local_work locked inside
It is possible for tw to lock the ring, and this was not propogated out to
io_run_local_work. This can cause an unlock to be missed.

Instead pass a pointer to locked into __io_run_local_work.

Fixes: 8ac5d85a89 ("io_uring: add local task_work run helper that is entered locked")
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221027144429.3971400-3-dylany@meta.com
[axboe: WARN_ON() -> WARN_ON_ONCE() and add a minor comment]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-27 09:52:12 -06:00
Dylan Yudaken
8de11cdc96 io_uring: use io_run_local_work_locked helper
prefer to use io_run_local_work_locked helper for consistency

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221027144429.3971400-2-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-27 09:51:47 -06:00
Pavel Begunkov
02bac94bd8 io_uring: don't iopoll from io_ring_ctx_wait_and_kill()
We should not be completing requests from a task context that has already
undergone io_uring cancellations, i.e. __io_uring_cancel(), as there are
some assumptions, e.g. around cached task refs draining. Remove
iopolling from io_ring_ctx_wait_and_kill() as it can be called later
after PF_EXITING is set with the last task_work run.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7c03cc91455c4a1af49c6b9cbda4e57ea467aa11.1665891182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-16 17:08:42 -06:00
Pavel Begunkov
34f0bc427e io_uring: reuse io_alloc_req()
Don't duplicate io_alloc_req() in io_req_caches_free() but reuse the
helper.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6005fc88274864a49fc3096c22d8bdd605cf8576.1665891182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-16 17:08:42 -06:00
Pavel Begunkov
4d5059512d io_uring: kill hot path fixed file bitmap debug checks
We test file_table.bitmap in io_file_get_fixed() to check invariants,
don't do it, it's expensive and was showing up in profiles. No reports of
this triggering has come in. Move the check to the file clear instead,
which will still catch any wrong usage.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cf77f2ded68d2e5b2bc7355784d969837d48e023.1665891182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-16 17:07:53 -06:00
Pavel Begunkov
38eddb2c75 io_uring: remove FFS_SCM
THe lifetime of SCM'ed files is bound to ring_sock, which is destroyed
strictly after we're done with registered file tables. This means there
is no need for the FFS_SCM hack, which was not available on 32-bit builds
anyway.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/984226a1045adf42dc35d8bd7fb5a8bbfa472ce1.1665891182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-16 17:07:12 -06:00
Pavel Begunkov
44f87745d5 io_uring: optimise locking for local tw with submit_wait
Running local task_work requires taking uring_lock, for submit + wait we
can try to run them right after submit while we still hold the lock and
save one lock/unlokc pair. The optimisation was implemented in the first
local tw patches but got dropped for simplicity.

Suggested-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/281fc79d98b5d91fe4778c5137a17a2ab4693e5c.1665088876.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Pavel Begunkov
fc86f9d3bb io_uring: remove redundant memory barrier in io_req_local_work_add
io_cqring_wake() needs a barrier for the waitqueue_active() check.
However, in the case of io_req_local_work_add(), we call llist_add()
first, which implies an atomic. Hence we can replace smb_mb() with
smp_mb__after_atomic().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/43983bc8bc507172adda7a0f00cab1aff09fd238.1665018309.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Pavel Begunkov
b7a817752e io_uring: remove notif leftovers
Notifications were killed but there is a couple of fields and struct
declarations left, remove them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8df8877d677be5a2b43afd936d600e60105ea960.1664849941.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Pavel Begunkov
42b6419d0a io_uring: correct pinned_vm accounting
->mm_account should be released only after we free all registered
buffers, otherwise __io_sqe_buffers_unregister() will see a NULL
->mm_account and skip locked_vm accounting.

Cc: <Stable@vger.kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6d798f65ed4ab8db3664c4d3397d4af16ca98846.1664849932.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Pavel Begunkov
d7cce96c44 io_uring: limit registration w/ SINGLE_ISSUER
IORING_SETUP_SINGLE_ISSUER restricts what tasks can submit requests.
Extend it to registration as well, so non-owning task can't do
registrations. It's not necessary at the moment but might be useful in
the future.

Cc: <stable@vger.kernel.org> # 6.0
Fixes: 97bbdc06a4 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f52a6a9c8a8990d4a831f73c0571e7406aac2bba.1664237592.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-07 12:25:30 -06:00
Dylan Yudaken
97c96e9fa3 io_uring: simplify __io_uring_add_tctx_node
Remove submitter parameter from __io_uring_add_tctx_node.

It was only called from one place, and we can do that logic in that one
place.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Fixes: 97bbdc06a4 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-07 12:25:30 -06:00
Linus Torvalds
513389809e for-6.1/block-2022-10-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67XkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiHoD/9eN+6YnNRPu5+2zeGnnm1Nlwic6YMZeORr
 KFIeC0COMWoFhNBIPFkgAKT+0qIH+uGt5UsHSM3Y5La7wMR8yLxD4PAnvTZ/Ijtt
 yxVIOmonJoQ0OrQ2kTbvDXL/9OCUrzwXXyUIEPJnH0Ca1mxeNOgDHbE7VGF6DMul
 0D3pI8qs2WLnHlDi1V/8kH5qZ6WoAJSDcb8sTzOUVnyveZPNaZhGQJuHA2XAYMtg
 fqKMDJqgmNk6jdTMUgdF5B+rV64PQoCy28I7fXqGkEe+RE5TBy57vAa0XY84V8XR
 /a8CEuwMts2ypk1hIcJG8Vv8K6u5war9yPM5MTngKsoMpzNIlhrhaJQVyjKdcs+E
 Ixwzexu6xTYcrcq+mUARgeTh79FzTBM/uXEdbCG2G3S6HPd6UZWUJZGfxw/l0Aem
 V4xB7lj6SQaJDU1iJCYUaHcekNXhQAPvyVG+R2ED1SO3McTpTPIM1aeigxw6vj7u
 bH3Kfdr94Z8HNuoLuiS6YYfjNt2Shf4LEB6GxKJ9TYHtyhdOyO0H64jGHpygrWqN
 cSnkWPUqUUNpF7srKM0ZgbliCshvmyJc4aMOFd0gBY/kXf5J/j7IXvh8TFCi9rHH
 0KyZH3/3Zsu9geUn3ynznlr4FXU+BcqE6boaa/iWb9sN1m+Rvaahv8cSch/dh44a
 vQNj/iOBQA==
 =R05e
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Christoph:
      - handle number of queue changes in the TCP and RDMA drivers
        (Daniel Wagner)
      - allow changing the number of queues in nvmet (Daniel Wagner)
      - also consider host_iface when checking ip options (Daniel
        Wagner)
      - don't map pages which can't come from HIGHMEM (Fabio M. De
        Francesco)
      - avoid unnecessary flush bios in nvmet (Guixin Liu)
      - shrink and better pack the nvme_iod structure (Keith Busch)
      - add comment for unaligned "fake" nqn (Linjun Bao)
      - print actual source IP address through sysfs "address" attr
        (Martin Belanger)
      - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang)
      - handle effects after freeing the request (Keith Busch)
      - copy firmware_rev on each init (Keith Busch)
      - restrict management ioctls to admin (Keith Busch)
      - ensure subsystem reset is single threaded (Keith Busch)
      - report the actual number of tagset maps in nvme-pci (Keith
        Busch)
      - small fabrics authentication fixups (Christoph Hellwig)
      - add common code for tagset allocation and freeing (Christoph
        Hellwig)
      - stop using the request_queue in nvmet (Christoph Hellwig)
      - set min_align_mask before calculating max_hw_sectors (Rishabh
        Bhatnagar)
      - send a rediscover uevent when a persistent discovery controller
        reconnects (Sagi Grimberg)
      - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)

 - MD pull request via Song:
      - Various raid5 fix and clean up, by Logan Gunthorpe and David
        Sloan.
      - Raid10 performance optimization, by Yu Kuai.

 - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu)

 - IO scheduler switching quisce fix (Keith)

 - s390/dasd block driver updates (Stefan)

 - support for recovery for the ublk driver (ZiyangZhang)

 - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph)

 - blk-mq and null_blk map fixes (Bart)

 - various bcache fixes (Coly, Jilin, Jules)

 - nbd signal hang fix (Shigeru)

 - block writeback throttling fix (Yu)

 - optimize the passthrough mapping handling (me)

 - prepare block cgroups to being gendisk based (Christoph)

 - get rid of an old PSI hack in the block layer, moving it to the
   callers instead where it belongs (Christoph)

 - blk-throttle fixes and cleanups (Yu)

 - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj,
   Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming,
   Miaohe, Bart, Coly, Gaosheng

* tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits)
  sbitmap: fix lockup while swapping
  block: add rationale for not using blk_mq_plug() when applicable
  block: adapt blk_mq_plug() to not plug for writes that require a zone lock
  s390/dasd: use blk_mq_alloc_disk
  blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
  nvmet: don't look at the request_queue in nvmet_bdev_set_limits
  nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
  blk-mq: use quiesced elevator switch when reinitializing queues
  block: replace blk_queue_nowait with bdev_nowait
  nvme: remove nvme_ctrl_init_connect_q
  nvme-loop: use the tagset alloc/free helpers
  nvme-loop: store the generic nvme_ctrl in set->driver_data
  nvme-loop: initialize sqsize later
  nvme-fc: use the tagset alloc/free helpers
  nvme-fc: store the generic nvme_ctrl in set->driver_data
  nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
  nvme-rdma: use the tagset alloc/free helpers
  nvme-rdma: store the generic nvme_ctrl in set->driver_data
  nvme-tcp: use the tagset alloc/free helpers
  nvme-tcp: store the generic nvme_ctrl in set->driver_data
  ...
2022-10-07 09:19:14 -07:00
Linus Torvalds
0a78a376ef for-6.1/io_uring-2022-10-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67S0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppnPEACkBzilBLKwT9MWdUAITwyrMXsAa1R9gsR9
 Tb3Xs+mNO2meuycLAUh4LIbb28NNr7/S5rwWet5NRZ71hgv4Q/WA/0EemAGGXYqd
 +3MEBAWU3FBFkC/cJXCnT8F5yCXYRkT5n/hzCSNEpNKjQ5JnAhHDlWAjgzZRuD/A
 A+YJjoBVJJuI1wY4I5XCpeQXEmg/Wc1MgXfyHgWVtGKnYrrxibiCnBZnqbAMZNvD
 hGn1Vl02ooamGTFm/nW/OAt71DtqsjWUCVOHKmlZ+zBUjbUj6FMXmPVV7vCV9o2w
 PT4Dx3CTc2iXwa8KfEFNPvXBzy0Qfu8edweP/MvZHWHVZREpEAh4cG6GhwW8whD+
 5mPisqmRjZKe0BBS4k/wKN1RXEypSQoTU4EdljfbQPU/usn35lmjMmEXXgs3IhqM
 fcTdO5ZUOp+CGyzI0Bc7UtS8vilJbX9ynN8G80MUUAZzuQg39MH7lNQYSJSSvJfU
 OlvzmL3lhRLYM1s/KKiZzdDBoMvC7R4oHmzCveOjQTMIHf6WNyqKFlrWScq2wzpN
 oRxqt0xiVQ3PFMmFj6N08f145qtbASuF3sKv7dbU3QXTsCAos3wdTdX+PejYApEZ
 W3dr0TDjNBicLNVPiSj132p0ZRtdZvLGuGVkBD4GPQeH2NwswxMHQAfz8e2lqmA4
 9bWG6BM7Yw==
 =m9kX
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Add supported for more directly managed task_work running.

   This is beneficial for real world applications that end up issuing
   lots of system calls as part of handling work. Normal task_work will
   always execute as we transition in and out of the kernel, even for
   "unrelated" system calls. It's more efficient to defer the handling
   of io_uring's deferred work until the application wants it to be run,
   generally in batches.

   As part of ongoing work to write an io_uring network backend for
   Thrift, this has been shown to greatly improve performance. (Dylan)

 - Add IOPOLL support for passthrough (Kanchan)

 - Improvements and fixes to the send zero-copy support (Pavel)

 - Partial IO handling fixes (Pavel)

 - CQE ordering fixes around CQ ring overflow (Pavel)

 - Support sendto() for non-zc as well (Pavel)

 - Support sendmsg for zerocopy (Pavel)

 - Networking iov_iter fix (Stefan)

 - Misc fixes and cleanups (Pavel, me)

* tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux: (56 commits)
  io_uring/net: fix notif cqe reordering
  io_uring/net: don't update msg_name if not provided
  io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
  io_uring/rw: defer fsnotify calls to task context
  io_uring/net: fix fast_iov assignment in io_setup_async_msg()
  io_uring/net: fix non-zc send with address
  io_uring/net: don't skip notifs for failed requests
  io_uring/rw: don't lose short results on io_setup_async_rw()
  io_uring/rw: fix unexpected link breakage
  io_uring/net: fix cleanup double free free_iov init
  io_uring: fix CQE reordering
  io_uring/net: fix UAF in io_sendrecv_fail()
  selftest/net: adjust io_uring sendzc notif handling
  io_uring: ensure local task_work marks task as running
  io_uring/net: zerocopy sendmsg
  io_uring/net: combine fail handlers
  io_uring/net: rename io_sendzc()
  io_uring/net: support non-zerocopy sendto
  io_uring/net: refactor io_setup_async_addr
  io_uring/net: don't lose partial send_zc on fail
  ...
2022-10-07 08:52:43 -07:00
Christoph Hellwig
568ec936bf block: replace blk_queue_nowait with bdev_nowait
Replace blk_queue_nowait with a bdev_nowait helpers that takes the
block_device given that the I/O submission path should not have to
look into the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-27 09:57:58 -06:00
Dylan Yudaken
7cae596bc3 io_uring: register single issuer task at creation
Instead of picking the task from the first submitter task, rather use the
creator task or in the case of disabled (IORING_SETUP_R_DISABLED) the
enabling task.

This approach allows a lot of simplification of the logic here. This
removes init logic from the submission path, which can always be a bit
confusing, but also removes the need for locking to write (or read) the
submitter_task.

Users that want to move a ring before submitting can create the ring
disabled and then enable it on the submitting task.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Fixes: 97bbdc06a4 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 11:26:18 -06:00
Jens Axboe
e775f93f2a io_uring: ensure that cached task references are always put on exit
io_uring caches task references to avoid doing atomics for each of them
per request. If a request is put from the same task that allocated it,
then we can maintain a per-ctx cache of them. This obviously relies
on io_uring always pruning caches in a reliable way, and there's
currently a case off io_uring fd release where we can miss that.

One example is a ring setup with IOPOLL, which relies on the task
polling for completions, which will free them. However, if such a task
submits a request and then exits or closes the ring without reaping
the completion, then ring release will reap and put. If release happens
from that very same task, the completed request task refs will get
put back into the cache pool. This is problematic, as we're now beyond
the point of pruning caches.

Manually drop these caches after doing an IOPOLL reap. This releases
references from the current task, which is enough. If another task
happens to be doing the release, then the caching will not be
triggered and there's no issue.

Cc: stable@vger.kernel.org
Fixes: e98e49b2bb ("io_uring: extend task put optimisations")
Reported-by: Homin Rhee <hominlab@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23 18:51:08 -06:00
Pavel Begunkov
aa1df3a360 io_uring: fix CQE reordering
Overflowing CQEs may result in reordering, which is buggy in case of
links, F_MORE and so on. If we guarantee that we don't reorder for
the unlikely event of a CQ ring overflow, then we can further extend
this to not have to terminate multishot requests if it happens. For
other operations, like zerocopy sends, we have no choice but to honor
CQE ordering.

Reported-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ec3bc55687b0768bbe20fb62d7d06cfced7d7e70.1663892031.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23 15:04:20 -06:00
Jens Axboe
ec7fd2562f io_uring: ensure local task_work marks task as running
io_uring will run task_work from contexts that have been prepared for
waiting, and in doing so it'll implicitly set the task running again
to avoid issues with blocking conditions. The new deferred local
task_work doesn't do that, which can result in spews on this being
an invalid condition:



[  112.917576] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000ad64af64>] prepare_to_wait_exclusive+0x3f/0xd0
[  112.983088] WARNING: CPU: 1 PID: 190 at kernel/sched/core.c:9819 __might_sleep+0x5a/0x60
[  112.987240] Modules linked in:
[  112.990504] CPU: 1 PID: 190 Comm: io_uring Not tainted 6.0.0-rc6+ #1617
[  113.053136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[  113.133650] RIP: 0010:__might_sleep+0x5a/0x60
[  113.136507] Code: ee 48 89 df 5b 31 d2 5d e9 33 ff ff ff 48 8b 90 30 0b 00 00 48 c7 c7 90 de 45 82 c6 05 20 8b 79 01 01 48 89 d1 e8 3a 49 77 00 <0f> 0b eb d1 66 90 0f 1f 44 00 00 9c 58 f6 c4 02 74 35 65 8b 05 ed
[  113.223940] RSP: 0018:ffffc90000537ca0 EFLAGS: 00010286
[  113.232903] RAX: 0000000000000000 RBX: ffffffff8246782c RCX: ffffffff8270bcc8
IOPS=133.15K, BW=520MiB/s, IOS/call=32/31
[  113.353457] RDX: ffffc90000537b50 RSI: 00000000ffffdfff RDI: 0000000000000001
[  113.358970] RBP: 00000000000003bc R08: 0000000000000000 R09: c0000000ffffdfff
[  113.361746] R10: 0000000000000001 R11: ffffc90000537b48 R12: ffff888103f97280
[  113.424038] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
[  113.428009] FS:  00007f67ae7fc700(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000
[  113.432794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  113.503186] CR2: 00007f67b8b9b3b0 CR3: 0000000102b9b005 CR4: 0000000000770ee0
[  113.507291] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  113.512669] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  113.574374] PKRU: 55555554
[  113.576800] Call Trace:
[  113.578325]  <TASK>
[  113.579799]  set_page_dirty_lock+0x1b/0x90
[  113.582411]  __bio_release_pages+0x141/0x160
[  113.673078]  ? set_next_entity+0xd7/0x190
[  113.675632]  blk_rq_unmap_user+0xaa/0x210
[  113.678398]  ? timerqueue_del+0x2a/0x40
[  113.679578]  nvme_uring_task_cb+0x94/0xb0
[  113.683025]  __io_run_local_work+0x8a/0x150
[  113.743724]  ? io_cqring_wait+0x33d/0x500
[  113.746091]  io_run_local_work.part.76+0x2e/0x60
[  113.750091]  io_cqring_wait+0x2e7/0x500
[  113.752395]  ? trace_event_raw_event_io_uring_req_failed+0x180/0x180
[  113.823533]  __x64_sys_io_uring_enter+0x131/0x3c0
[  113.827382]  ? switch_fpu_return+0x49/0xc0
[  113.830753]  do_syscall_64+0x34/0x80
[  113.832620]  entry_SYSCALL_64_after_hwframe+0x5e/0xc8

Ensure that we mark current as TASK_RUNNING for deferred task_work
as well.

Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Reported-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 19:39:35 -06:00
Pavel Begunkov
a47b255e90 io_uring: add custom opcode hooks on fail
Sometimes we have to do a little bit of a fixup on a request failuer in
io_req_complete_failed(). Add a callback in opdef for that.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b734cff4e67cb30cca976b9face321023f37549a.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov
7924fdfeea io_uring: add fast path for io_run_local_work()
We'll grab uring_lock and call __io_run_local_work() with several
atomics inside even if there are no task works. Skip it if ->work_llist
is empty.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/f6a885f372bad2d77d9cd87341b0a86a4000c0ff.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:01 -06:00
Pavel Begunkov
1f8d5bbe98 io_uring/iopoll: unify tw breaking logic
Let's keep checks for whether to break the iopoll loop or not same for
normal and defer tw, this includes ->cached_cq_tail checks guarding
against polling more than asked for.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d2fa8a44f8114f55a4807528da438cde93815360.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:01 -06:00
Pavel Begunkov
9d54bd6a3b io_uring/iopoll: fix unexpected returns
We may propagate a positive return value of io_run_task_work() out of
io_iopoll_check(), which breaks our tests. io_run_task_work() doesn't
return anything useful for us, ignore the return value.

Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/c442bb87f79cea10b3f857cbd4b9a4f0a0493fa3.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:14:59 -06:00