linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-29 05:44:11 +00:00

Author	SHA1	Message	Date
Ming Lei	fda4233d24	io_uring: complete request via task work in case of DEFER_TASKRUN commit `860e1c7f8b` upstream. So far io_req_complete_post() only covers DEFER_TASKRUN by completing request via task work when the request is completed from IOWQ. However, uring command could be completed from any context, and if io uring is setup with DEFER_TASKRUN, the command is required to be completed from current context, otherwise wait on IORING_ENTER_GETEVENTS can't be wakeup, and may hang forever. The issue can be observed on removing ublk device, but turns out it is one generic issue for uring command & DEFER_TASKRUN, so solve it in io_uring core code. Fixes: `e6aeb2721d` ("io_uring: complete all requests in task context") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-block/b3fc9991-4c53-9218-a8cc-5b4dd3952108@kernel.dk/ Reported-by: Jens Axboe <axboe@kernel.dk> Cc: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-04-20 12:36:52 +02:00
Wojciech Lukowicz	c117c15927	io_uring: fix memory leak when removing provided buffers [ Upstream commit `b4a72c0589` ] When removing provided buffers, io_buffer structs are not being disposed of, leading to a memory leak. They can't be freed individually, because they are allocated in page-sized groups. They need to be added to some free list instead, such as io_buffers_cache. All callers already hold the lock protecting it, apart from when destroying buffers, so had to extend the lock there. Fixes: `cc3cec8367` ("io_uring: speedup provided buffer handling") Signed-off-by: Wojciech Lukowicz <wlukowicz01@gmail.com> Link: https://lore.kernel.org/r/20230401195039.404909-2-wlukowicz01@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-04-13 17:02:53 +02:00
Wojciech Lukowicz	af88268496	io_uring: fix return value when removing provided buffers [ Upstream commit `c0921e51da` ] When a request to remove buffers is submitted, and the given number to be removed is larger than available in the specified buffer group, the resulting CQE result will be the number of removed buffers + 1, which is 1 more than it should be. Previously, the head was part of the list and it got removed after the loop, so the increment was needed. Now, the head is not an element of the list, so the increment shouldn't be there anymore. Fixes: `dbc7d452e7` ("io_uring: manage provided buffers strictly ordered") Signed-off-by: Wojciech Lukowicz <wlukowicz01@gmail.com> Link: https://lore.kernel.org/r/20230401195039.404909-2-wlukowicz01@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-04-13 17:02:53 +02:00
Pavel Begunkov	3a6e1616e8	io_uring: fix poll/netmsg alloc caches commit `fd30d1cdcc` upstream. We increase cache->nr_cached when we free into the cache but don't decrease when we take from it, so in some time we'll get an empty cache with cache->nr_cached larger than IO_ALLOC_CACHE_MAX, that fails io_alloc_cache_put() and effectively disables caching. Fixes: `9b797a37c4` ("io_uring: add abstraction around apoll cache") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-04-06 12:12:42 +02:00
Pavel Begunkov	0651b59810	io_uring/rsrc: fix rogue rsrc node grabbing commit `4ff0b50de8` upstream. We should not be looking at ctx->rsrc_node and anyhow modifying the node without holding uring_lock, grabbing references in such a way is not safe either. Cc: stable@vger.kernel.org Fixes: `5106dd6e74` ("io_uring: propagate issue_flags state down to file assignment") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1202ede2d7bb90136e3482b2b84aad9ed483e5d6.1680098433.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-04-06 12:12:42 +02:00
Jens Axboe	6ec9b2d103	io_uring/poll: clear single/double poll flags on poll arming commit `005308f7bd` upstream. Unless we have at least one entry queued, then don't call into io_poll_remove_entries(). Normally this isn't possible, but if we retry poll then we can have ->nr_entries cleared again as we're setting it up. If this happens for a poll retry, then we'll still have at least REQ_F_SINGLE_POLL set. io_poll_remove_entries() then thinks it has entries to remove. Clear REQ_F_SINGLE_POLL and REQ_F_DOUBLE_POLL unconditionally when arming a poll request. Fixes: `c16bda3759` ("io_uring/poll: allow some retries for poll triggering spuriously") Cc: stable@vger.kernel.org Reported-by: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-04-06 12:12:42 +02:00
Savino Dicanosa	2ff9f7319b	io_uring/rsrc: fix null-ptr-deref in io_file_bitmap_get() commit `02a4d923e4` upstream. When fixed files are unregistered, file_alloc_end and alloc_hint are not cleared. This can later cause a NULL pointer dereference in io_file_bitmap_get() if auto index selection is enabled via IORING_FILE_INDEX_ALLOC: [ 6.519129] BUG: kernel NULL pointer dereference, address: 0000000000000000 [...] [ 6.541468] RIP: 0010:_find_next_zero_bit+0x1a/0x70 [...] [ 6.560906] Call Trace: [ 6.561322] <TASK> [ 6.561672] io_file_bitmap_get+0x38/0x60 [ 6.562281] io_fixed_fd_install+0x63/0xb0 [ 6.562851] ? __pfx_io_socket+0x10/0x10 [ 6.563396] io_socket+0x93/0xf0 [ 6.563855] ? __pfx_io_socket+0x10/0x10 [ 6.564411] io_issue_sqe+0x5b/0x3d0 [ 6.564914] io_submit_sqes+0x1de/0x650 [ 6.565452] __do_sys_io_uring_enter+0x4fc/0xb20 [ 6.566083] ? __do_sys_io_uring_register+0x11e/0xd80 [ 6.566779] do_syscall_64+0x3c/0x90 [ 6.567247] entry_SYSCALL_64_after_hwframe+0x72/0xdc [...] To fix the issue, set file alloc range and alloc_hint to zero after file tables are freed. Cc: stable@vger.kernel.org Fixes: `4278a0deb1` ("io_uring: defer alloc_hint update to io_file_bitmap_set()") Signed-off-by: Savino Dicanosa <sd7.dev@pm.me> [axboe: add explicit bitmap == NULL check as well] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-30 12:51:37 +02:00
Jens Axboe	7bc203584e	io_uring/net: avoid sending -ECONNABORTED on repeated connection requests commit `74e2e17ee1` upstream. Since io_uring does nonblocking connect requests, if we do two repeated ones without having a listener, the second will get -ECONNABORTED rather than the expected -ECONNREFUSED. Treat -ECONNABORTED like a normal retry condition if we're nonblocking, if we haven't already seen it. Cc: stable@vger.kernel.org Fixes: `3fb1bd6881` ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT") Link: https://github.com/axboe/liburing/issues/828 Reported-by: Hui, Chunyang <sanqian.hcy@antgroup.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-30 12:51:37 +02:00
Jens Axboe	035c77fb09	block/io_uring: pass in issue_flags for uring_cmd task_work handling commit `9d2789ac9d` upstream. io_uring_cmd_done() currently assumes that the uring_lock is held when invoked, and while it generally is, this is not guaranteed. Pass in the issue_flags associated with it, so that we have IO_URING_F_UNLOCKED available to be able to lock the CQ ring appropriately when completing events. Cc: stable@vger.kernel.org Fixes: `ee692a21e9` ("fs,io_uring: add infrastructure for uring-cmd") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-30 12:51:34 +02:00
Pavel Begunkov	b0fb2373c3	io_uring/msg_ring: let target know allocated index commit `5da28edd7b` upstream. msg_ring requests transferring files support auto index selection via IORING_FILE_INDEX_ALLOC, however they don't return the selected index to the target ring and there is no other good way for the userspace to know where is the receieved file. Return the index for allocated slots and 0 otherwise, which is consistent with other fixed file installing requests. Cc: stable@vger.kernel.org # v6.0+ Fixes: `e6130eba8a` ("io_uring: add support for passing fixed file descriptors") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://github.com/axboe/liburing/issues/809 Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-22 13:38:08 +01:00
Jens Axboe	67c1a7decf	io_uring/uring_cmd: ensure that device supports IOPOLL commit `03b3d6be73` upstream. It's possible for a file type to support uring commands, but not pollable ones. Hence before issuing one of those, we should check that it is supported and error out upfront if it isn't. Cc: stable@vger.kernel.org Fixes: `5756a3a7e7` ("io_uring: add iopoll infrastructure for io_uring_cmd") Link: https://github.com/axboe/liburing/issues/816 Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-17 08:57:45 +01:00
Wojciech Lukowicz	94f7ac92d1	io_uring: fix size calculation when registering buf ring [ Upstream commit `48ba08374e` ] Using struct_size() to calculate the size of io_uring_buf_ring will sum the size of the struct and of the bufs array. However, the struct's fields are overlaid with the array making the calculated size larger than it should be. When registering a ring with N * PAGE_SIZE / sizeof(struct io_uring_buf) entries, i.e. with fully filled pages, the calculated size will span one more page than it should and io_uring will try to pin the following page. Depending on how the application allocated the ring, it might succeed using an unrelated page or fail returning EFAULT. The size of the ring should be the product of ring_entries and the size of io_uring_buf, i.e. the size of the bufs array only. Fixes: `c7fb19428d` ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Wojciech Lukowicz <wlukowicz01@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20230218184141.70891-1-wlukowicz01@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-03-11 13:50:35 +01:00
Jens Axboe	eaa2a19d24	io_uring: mark task TASK_RUNNING before handling resume/task work commit `2f2bb1ffc9` upstream. Just like for task_work, set the task mode to TASK_RUNNING before doing any potential resume work. We're not holding any locks at this point, but we may have already set the task state to TASK_INTERRUPTIBLE in preparation for going to sleep waiting for events. Ensure that we set it back to TASK_RUNNING if we have work to process, to avoid warnings on calling blocking operations with !TASK_RUNNING. Fixes: `b5d3ae202f` ("io_uring: handle TIF_NOTIFY_RESUME when checking for task_work") Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202302062208.24d3e563-oliver.sang@intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:29:28 +01:00
Joseph Qi	10fb2e16ee	io_uring: fix fget leak when fs don't support nowait buffered read commit `54aa7f2330` upstream. Heming reported a BUG when using io_uring doing link-cp on ocfs2. [1] Do the following steps can reproduce this BUG: mount -t ocfs2 /dev/vdc /mnt/ocfs2 cp testfile /mnt/ocfs2/ ./link-cp /mnt/ocfs2/testfile /mnt/ocfs2/testfile.1 umount /mnt/ocfs2 Then umount will fail, and it outputs: umount: /mnt/ocfs2: target is busy. While tracing umount, it blames mnt_get_count() not return as expected. Do a deep investigation for fget()/fput() on related code flow, I've finally found that fget() leaks since ocfs2 doesn't support nowait buffered read. io_issue_sqe \|-io_assign_file // do fget() first \|-io_read \|-io_iter_do_read \|-ocfs2_file_read_iter // return -EOPNOTSUPP \|-kiocb_done \|-io_rw_done \|-__io_complete_rw_common // set REQ_F_REISSUE \|-io_resubmit_prep \|-io_req_prep_async // override req->file, leak happens This was introduced by commit `a196c78b54` in v5.18. Fix it by don't re-assign req->file if it has already been assigned. [1] https://lore.kernel.org/ocfs2-devel/ab580a75-91c8-d68a-3455-40361be1bfa8@linux.alibaba.com/T/#t Fixes: `a196c78b54` ("io_uring: assign non-fixed early for async work") Cc: <stable@vger.kernel.org> Reported-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Link: https://lore.kernel.org/r/20230228045459.13524-1-joseph.qi@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:29:24 +01:00
Jens Axboe	6f1e744171	io_uring/poll: allow some retries for poll triggering spuriously commit `c16bda3759` upstream. If we get woken spuriously when polling and fail the operation with -EAGAIN again, then we generally only allow polling again if data had been transferred at some point. This is indicated with REQ_F_PARTIAL_IO. However, if the spurious poll triggers when the socket was originally empty, then we haven't transferred data yet and we will fail the poll re-arm. This either punts the socket to io-wq if it's blocking, or it fails the request with -EAGAIN if not. Neither condition is desirable, as the former will slow things down, while the latter will make the application confused. We want to ensure that a repeated poll trigger doesn't lead to infinite work making no progress, that's what the REQ_F_PARTIAL_IO check was for. But it doesn't protect against a loop post the first receive, and it's unnecessarily strict if we started out with an empty socket. Add a somewhat random retry count, just to put an upper limit on the potential number of retries that will be done. This should be high enough that we won't really hit it in practice, unless something needs to be aborted anyway. Cc: stable@vger.kernel.org # v5.10+ Link: https://github.com/axboe/liburing/issues/364 Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:29:24 +01:00
David Lamparter	9b5ed6b4e6	io_uring: remove MSG_NOSIGNAL from recvmsg commit `7605c43d67` upstream. MSG_NOSIGNAL is not applicable for the receiving side, SIGPIPE is generated when trying to write to a "broken pipe". AF_PACKET's packet_recvmsg() does enforce this, giving back EINVAL when MSG_NOSIGNAL is set - making it unuseable in io_uring's recvmsg. Remove MSG_NOSIGNAL from io_recvmsg_prep(). Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: David Lamparter <equinox@diac24.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230224150123.128346-1-equinox@diac24.net Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:29:24 +01:00
Pavel Begunkov	e559fb5d87	io_uring/rsrc: disallow multi-source reg buffers commit `edd4782696` upstream. If two or more mappings go back to back to each other they can be passed into io_uring to be registered as a single registered buffer. That would even work if mappings came from different sources, e.g. it's possible to mix in this way anon pages and pages from shmem or hugetlb. That is not a problem but it'd rather be less prone if we forbid such mixing. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:29:24 +01:00
Jens Axboe	bb94f3c10a	io_uring: add reschedule point to handle_tw_list() commit `f586800854` upstream. If CONFIG_PREEMPT_NONE is set and the task_work chains are long, we could be running into issues blocking others for too long. Add a reschedule check in handle_tw_list(), and flush the ctx if we need to reschedule. Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:29:24 +01:00
Jens Axboe	300c2f021e	io_uring: add a conditional reschedule to the IOPOLL cancelation loop commit `fcc926bb85` upstream. If the kernel is configured with CONFIG_PREEMPT_NONE, we could be sitting in a tight loop reaping events but not giving them a chance to finish. This results in a trace ala: rcu: INFO: rcu_sched self-detected stall on CPU rcu: 2-...!: (5249 ticks this GP) idle=935c/1/0x4000000000000000 softirq=4265/4274 fqs=1 (t=5251 jiffies g=465 q=4135 ncpus=4) rcu: rcu_sched kthread starved for 5249 jiffies! g465 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_sched state:R running task stack:0 pid:12 ppid:2 flags:0x00000008 Call trace: __switch_to+0xb0/0xc8 __schedule+0x43c/0x520 schedule+0x4c/0x98 schedule_timeout+0xbc/0xdc rcu_gp_fqs_loop+0x308/0x344 rcu_gp_kthread+0xd8/0xf0 kthread+0xb8/0xc8 ret_from_fork+0x10/0x20 rcu: Stack dump where RCU GP kthread last ran: Task dump for CPU 0: task:kworker/u8:10 state:R running task stack:0 pid:89 ppid:2 flags:0x0000000a Workqueue: events_unbound io_ring_exit_work Call trace: __switch_to+0xb0/0xc8 0xffff0000c8fefd28 CPU: 2 PID: 95 Comm: kworker/u8:13 Not tainted 6.2.0-rc5-00042-g40316e337c80-dirty #2759 Hardware name: linux,dummy-virt (DT) Workqueue: events_unbound io_ring_exit_work pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : io_do_iopoll+0x344/0x360 lr : io_do_iopoll+0xb8/0x360 sp : ffff800009bebc60 x29: ffff800009bebc60 x28: 0000000000000000 x27: 0000000000000000 x26: ffff0000c0f67d48 x25: ffff0000c0f67840 x24: ffff800008950024 x23: 0000000000000001 x22: 0000000000000000 x21: ffff0000c27d3200 x20: ffff0000c0f67840 x19: ffff0000c0f67800 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000001 x13: 0000000000000001 x12: 0000000000000000 x11: 0000000000000179 x10: 0000000000000870 x9 : ffff800009bebd60 x8 : ffff0000c27d3ad0 x7 : fefefefefefefeff x6 : 0000646e756f626e x5 : ffff0000c0f67840 x4 : 0000000000000000 x3 : ffff0000c2398000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: io_do_iopoll+0x344/0x360 io_uring_try_cancel_requests+0x21c/0x334 io_ring_exit_work+0x90/0x40c process_one_work+0x1a4/0x254 worker_thread+0x1ec/0x258 kthread+0xb8/0xc8 ret_from_fork+0x10/0x20 Add a cond_resched() in the cancelation IOPOLL loop to fix this. Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:29:24 +01:00
Jens Axboe	ff43f73a64	io_uring: handle TIF_NOTIFY_RESUME when checking for task_work commit `b5d3ae202f` upstream. If TIF_NOTIFY_RESUME is set, then we need to call resume_user_mode_work() for PF_IO_WORKER threads. They never return to usermode, hence never get a chance to process any items that are marked by this flag. Most notably this includes the final put of files, but also any throttling markers set by block cgroups. Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:29:23 +01:00
Pavel Begunkov	3712482913	io_uring: use user visible tail in io_uring_poll() commit `c10bb64684` upstream. We return POLLIN from io_uring_poll() depending on whether there are CQEs for the userspace, and so we should use the user visible tail pointer instead of a transient cached value. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/228ffcbf30ba98856f66ffdb9a6a60ead1dd96c0.1674484266.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:29:23 +01:00
Richard Guy Briggs	7df3d6b77e	io_uring,audit: don't log IORING_OP_MADVISE [ Upstream commit `fbe870a72f` ] fadvise and madvise both provide hints for caching or access pattern for file and memory respectively. Skip them. Fixes: `5bd2182d58` ("audit,io_uring,io-wq: add some basic audit support to io_uring") Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Link: https://lore.kernel.org/r/b5dfdcd541115c86dbc774aa9dd502c964849c5f.1675282642.git.rgb@redhat.com Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-03-10 09:27:56 +01:00
Dylan Yudaken	ef5c600adb	io_uring: always prep_async for drain requests Drain requests all go through io_drain_req, which has a quick exit in case there is nothing pending (ie the drain is not useful). In that case it can run the issue the request immediately. However for safety it queues it through task work. The problem is that in this case the request is run asynchronously, but the async work has not been prepared through io_req_prep_async. This has not been a problem up to now, as the task work always would run before returning to userspace, and so the user would not have a chance to race with it. However - with IORING_SETUP_DEFER_TASKRUN - this is no longer the case and the work might be defered, giving userspace a chance to change data being referred to in the request. Instead _always_ prep_async for drain requests, which is simpler anyway and removes this issue. Cc: stable@vger.kernel.org Fixes: `c0e0d6ba25` ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20230127105911.2420061-1-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-27 06:29:29 -07:00
Jens Axboe	b00c51ef8f	io_uring/net: cache provided buffer group value for multishot receives If we're using ring provided buffers with multishot receive, and we end up doing an io-wq based issue at some points that also needs to select a buffer, we'll lose the initially assigned buffer group as io_ring_buffer_select() correctly clears the buffer group list as the issue isn't serialized by the ctx uring_lock. This is fine for normal receives as the request puts the buffer and finishes, but for multishot, we will re-arm and do further receives. On the next trigger for this multishot receive, the receive will try and pick from a buffer group whose value is the same as the buffer ID of the las receive. That is obviously incorrect, and will result in a premature -ENOUFS error for the receive even if we had available buffers in the correct group. Cache the buffer group value at prep time, so we can restore it for future receives. This only needs doing for the above mentioned case, but just do it by default to keep it easier to read. Cc: stable@vger.kernel.org Fixes: `b3fdea6ecb` ("io_uring: multishot recv") Fixes: `9bb66906f2` ("io_uring: support multishot in recvmsg") Cc: Dylan Yudaken <dylany@meta.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-23 07:08:08 -07:00
Jens Axboe	8caa03f10b	io_uring/poll: don't reissue in case of poll race on multishot request A previous commit fixed a poll race that can occur, but it's only applicable for multishot requests. For a multishot request, we can safely ignore a spurious wakeup, as we never leave the waitqueue to begin with. A blunt reissue of a multishot armed request can cause us to leak a buffer, if they are ring provided. While this seems like a bug in itself, it's not really defined behavior to reissue a multishot request directly. It's less efficient to do so as well, and not required to rearm anything like it is for singleshot poll requests. Cc: stable@vger.kernel.org Fixes: `6e5aedb932` ("io_uring/poll: attempt request issue after racy poll wakeup") Reported-and-tested-by: Olivier Langlois <olivier@trillion01.com> Link: https://github.com/axboe/liburing/issues/778 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-20 15:11:54 -07:00
Pavel Begunkov	8579538c89	io_uring/msg_ring: fix remote queue to disabled ring IORING_SETUP_R_DISABLED rings don't have the submitter task set, so it's not always safe to use ->submitter_task. Disallow posting msg_ring messaged to disabled rings. Also add task NULL check for loosy sync around testing for IORING_SETUP_R_DISABLED. Cc: stable@vger.kernel.org Fixes: `6d043ee116` ("io_uring: do msg_ring in target task via tw") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-20 09:49:34 -07:00
Pavel Begunkov	56d8e3180c	io_uring/msg_ring: fix flagging remote execution There is a couple of problems with queueing a tw in io_msg_ring_data() for remote execution. First, once we queue it the target ring can go away and so setting IORING_SQ_TASKRUN there is not safe. Secondly, the userspace might not expect IORING_SQ_TASKRUN. Extract a helper and uniformly use TWA_SIGNAL without TWA_SIGNAL_NO_IPI tricks for now, just as it was done in the original patch. Cc: stable@vger.kernel.org Fixes: `6d043ee116` ("io_uring: do msg_ring in target task via tw") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-20 09:49:29 -07:00
Jens Axboe	e12d7a46f6	io_uring/msg_ring: fix missing lock on overflow for IOPOLL If the target ring is configured with IOPOLL, then we always need to hold the target ring uring_lock before posting CQEs. We could just grab it unconditionally, but since we don't expect many target rings to be of this type, make grabbing the uring_lock conditional on the ring type. Link: https://lore.kernel.org/io-uring/Y8krlYa52%2F0YGqkg@ip-172-31-85-199.ec2.internal/ Reported-by: Xingyuan Mo <hdthky0@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-19 10:43:59 -07:00
Jens Axboe	423d5081d0	io_uring/msg_ring: move double lock/unlock helpers higher up In preparation for needing them somewhere else, move them and get rid of the unused 'issue_flags' for the unlock side. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-19 09:01:27 -07:00
Pavel Begunkov	544d163d65	io_uring: lock overflowing for IOPOLL syzbot reports an issue with overflow filling for IOPOLL: WARNING: CPU: 0 PID: 28 at io_uring/io_uring.c:734 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734 CPU: 0 PID: 28 Comm: kworker/u4:1 Not tainted 6.2.0-rc3-syzkaller-16369-g358a161a6a9e #0 Workqueue: events_unbound io_ring_exit_work Call trace: io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734 io_req_cqe_overflow+0x5c/0x70 io_uring/io_uring.c:773 io_fill_cqe_req io_uring/io_uring.h:168 [inline] io_do_iopoll+0x474/0x62c io_uring/rw.c:1065 io_iopoll_try_reap_events+0x6c/0x108 io_uring/io_uring.c:1513 io_uring_try_cancel_requests+0x13c/0x258 io_uring/io_uring.c:3056 io_ring_exit_work+0xec/0x390 io_uring/io_uring.c:2869 process_one_work+0x2d8/0x504 kernel/workqueue.c:2289 worker_thread+0x340/0x610 kernel/workqueue.c:2436 kthread+0x12c/0x158 kernel/kthread.c:376 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:863 There is no real problem for normal IOPOLL as flush is also called with uring_lock taken, but it's getting more complicated for IOPOLL\|SQPOLL, for which __io_cqring_overflow_flush() happens from the CQ waiting path. Reported-and-tested-by: syzbot+6805087452d72929404e@syzkaller.appspotmail.com Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-13 07:32:46 -07:00
Jens Axboe	6e5aedb932	io_uring/poll: attempt request issue after racy poll wakeup If we have multiple requests waiting on the same target poll waitqueue, then it's quite possible to get a request triggered and get disappointed in not being able to make any progress with it. If we race in doing so, we'll potentially leave the poll request on the internal tables, but removed from the waitqueue. That means that any subsequent trigger of the poll waitqueue will not kick that request into action, causing an application to potentially wait for completion of a request that will never happen. Fix this by adding a new poll return state, IOU_POLL_REISSUE. Rather than have complicated logic for how to re-arm a given type of request, just punt it for a reissue. While in there, move the 'ret' variable to the only section where it gets used. This avoids confusion the scope of it. Cc: stable@vger.kernel.org Fixes: `eb0089d629` ("io_uring: single shot poll removal optimisation") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-12 10:35:51 -07:00
Jens Axboe	ea97cbebaf	io_uring/fdinfo: include locked hash table in fdinfo output A previous commit split the hash table for polled requests into two parts, but didn't get the fdinfo output updated. This means that it's less useful for debugging, as we may think a given request is not pending poll. Fix this up by dumping the locked hash table contents too. Fixes: `9ca9fb24d5` ("io_uring: mutex locked poll hashing") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-10 10:24:52 -07:00
Jens Axboe	febb985c06	io_uring/poll: add hash if ready poll request can't complete inline If we don't, then we may lose access to it completely, leading to a request leak. This will eventually stall the ring exit process as well. Cc: stable@vger.kernel.org Fixes: `49f1c68e04` ("io_uring: optimise submission side poll_refs") Reported-and-tested-by: syzbot+6c95df01470a47fc3af4@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/0000000000009f829805f1ce87b2@google.com/ Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-09 15:46:57 -07:00
Jens Axboe	e6db6f9398	io_uring/io-wq: only free worker if it was allocated for creation We have two types of task_work based creation, one is using an existing worker to setup a new one (eg when going to sleep and we have no free workers), and the other is allocating a new worker. Only the latter should be freed when we cancel task_work creation for a new worker. Fixes: `af82425c6a` ("io_uring/io-wq: free worker if task_work creation is canceled") Reported-by: syzbot+d56ec896af3637bdb7e4@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-08 10:39:17 -07:00
Pavel Begunkov	12521a5d5c	io_uring: fix CQ waiting timeout handling Jiffy to ktime CQ waiting conversion broke how we treat timeouts, in particular we rearm it anew every time we get into io_cqring_wait_schedule() without adjusting the timeout. Waiting for 2 CQEs and getting a task_work in the middle may double the timeout value, or even worse in some cases task may wait indefinitely. Cc: stable@vger.kernel.org Fixes: `228339662b` ("io_uring: don't convert to jiffies for waiting on timeouts") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f7bffddd71b08f28a877d44d37ac953ddb01590d.1672915663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-05 08:04:47 -07:00
Pavel Begunkov	f26cc95935	io_uring: lockdep annotate CQ locking Locking around CQE posting is complex and depends on options the ring is created with, add more thorough lockdep annotations checking all invariants. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/aa3770b4eacae3915d782cc2ab2f395a99b4b232.1672795976.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-03 19:05:41 -07:00
Pavel Begunkov	9ffa13ff78	io_uring: pin context while queueing deferred tw Unlike normal tw, nothing prevents deferred tw to be executed right after an tw item added to ->work_llist in io_req_local_work_add(). For instance, the waiting task may get waken up by CQ posting or a normal tw. Thus we need to pin the ring for the rest of io_req_local_work_add() Cc: stable@vger.kernel.org Fixes: `c0e0d6ba25` ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1a79362b9c10b8523ef70b061d96523650a23344.1672795998.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-03 19:03:28 -07:00
Jens Axboe	af82425c6a	io_uring/io-wq: free worker if task_work creation is canceled If we cancel the task_work, the worker will never come into existance. As this is the last reference to it, ensure that we get it freed appropriately. Cc: stable@vger.kernel.org Reported-by: 진호 <wnwlsgh98@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-02 16:49:46 -07:00
Jens Axboe	343190841a	io_uring: check for valid register opcode earlier We only check the register opcode value inside the restricted ring section, move it into the main io_uring_register() function instead and check it up front. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-23 06:40:32 -07:00
Jens Axboe	23fffb2f09	io_uring/cancel: re-grab ctx mutex after finishing wait If we have a signal pending during cancelations, it'll cause the task_work run to return an error. Since we didn't run task_work, the current task is left in TASK_INTERRUPTIBLE state when we need to re-grab the ctx mutex, and the kernel will rightfully complain about that. Move the lock grabbing for the error cases outside the loop to avoid that issue. Reported-by: syzbot+7df055631cd1be4586fd@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/0000000000003a14a905f05050b0@google.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-21 13:31:40 -07:00
Jens Axboe	52ea806ad9	io_uring: finish waiting before flushing overflow entries If we have overflow entries being generated after we've done the initial flush in io_cqring_wait(), then we could be flushing them in the main wait loop as well. If that's done after having added ourselves to the cq_wait waitqueue, then the task state can be != TASK_RUNNING when we enter the overflow flush. Check for the need to overflow flush, and finish our wait cycle first if we have to do so. Reported-and-tested-by: syzbot+cf6ea1d6bb30a4ce10b2@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/000000000000cb143a05f04eee15@google.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-21 08:43:53 -07:00
Pavel Begunkov	6c3e8955d4	io_uring/net: fix cleanup after recycle Don't access io_async_msghdr io_netmsg_recycle(), it may be reallocated. Cc: stable@vger.kernel.org Fixes: `9bb66906f2` ("io_uring: support multishot in recvmsg") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9e326f4ad4046ddadf15bf34bf3fa58c6372f6b5.1671461985.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-19 08:28:28 -07:00
Jens Axboe	990a4de57e	io_uring/net: ensure compat import handlers clear free_iov If we're not allocating the vectors because the count is below UIO_FASTIOV, we still do need to properly clear ->free_iov to prevent an erronous free of on-stack data. Reported-by: Jiri Slaby <jirislaby@gmail.com> Fixes: `4c17a496a7` ("io_uring/net: fix cleanup double free free_iov init") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-19 07:35:16 -07:00
Jens Axboe	35d90f95cf	io_uring: include task_work run after scheduling in wait for events It's quite possible that we got woken up because task_work was queued, and we need to process this task_work to generate the events waited for. If we return to the wait loop without running task_work, we'll end up adding the task to the waitqueue again, only to call io_cqring_wait_schedule() again which will run the task_work. This is less efficient than it could be, as it requires adding to the cq_wait queue again. It also triggers the wakeup path for completions as cq_wait is now non-empty with the task itself, and it'll require another lock grab and deletion to remove ourselves from the waitqueue. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-17 20:35:54 -07:00
Jens Axboe	6434ec0186	io_uring: don't use TIF_NOTIFY_SIGNAL to test for availability of task_work Use task_work_pending() as a better test for whether we have task_work or not, TIF_NOTIFY_SIGNAL is only valid if the any of the task_work items had been queued with TWA_SIGNAL as the notification mechanism. Hence task_work_pending() is a more reliable check. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-17 13:40:17 -07:00
Dylan Yudaken	44a84da452	io_uring: use call_rcu_hurry if signaling an eventfd io_uring uses call_rcu in the case it needs to signal an eventfd as a result of an eventfd signal, since recursing eventfd signals are not allowed. This should be calling the new call_rcu_hurry API to not delay the signal. Signed-off-by: Dylan Yudaken <dylany@meta.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Paul E. McKenney <paulmck@kernel.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lore.kernel.org/r/20221215184138.795576-1-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-15 11:59:29 -07:00
Pavel Begunkov	a8cf95f936	io_uring: fix overflow handling regression Because the single task locking series got reordered ahead of the timeout and completion lock changes, two hunks inadvertently ended up using __io_fill_cqe_req() rather than io_fill_cqe_req(). This meant that we dropped overflow handling in those two spots. Reinstate the correct CQE filling helper. Fixes: `f66f73421f` ("io_uring: skip spinlocking for ->task_complete") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-15 08:20:10 -07:00
Pavel Begunkov	e5f30f6fb2	io_uring: ease timeout flush locking requirements We don't need completion_lock for timeout flushing, don't take it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e3dc657975ac445b80e7bdc40050db783a5935a.1670002973.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-14 08:53:35 -07:00
Pavel Begunkov	6971253f07	io_uring: revise completion_lock locking io_kill_timeouts() doesn't post any events but queues everything to task_work. Locking there is needed for protecting linked requests traversing, we should grab completion_lock directly instead of using io_cq_[un]lock helpers. Same goes for __io_req_find_next_prep(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/88e75d481a65dc295cb59722bb1cf76402d1c06b.1670002973.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-14 08:53:04 -07:00
Pavel Begunkov	ea011ee102	io_uring: protect cq_timeouts with timeout_lock Read cq_timeouts in io_flush_timeouts() only after taking the timeout_lock, as it's protected by it. There are many places where we also grab ->completion_lock, but for instance io_timeout_fn() doesn't and still modifies cq_timeouts. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9c79544dd6cf5c4018cb1bab99cf481a93ea46ef.1670002973.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-12-14 08:51:28 -07:00

1 2 3 4 5 ...

460 commits