Commit Graph

710 Commits

Author SHA1 Message Date
Eric Dumazet 1036908045 net: reclaim skb->scm_io_uring bit
Commit 0091bfc817 ("io_uring/af_unix: defer registered
files gc to io_uring release") added one bit to struct sk_buff.

This structure is critical for networking, and we try very hard
to not add bloat on it, unless absolutely required.

For instance, we can use a specific destructor as a wrapper
around unix_destruct_scm(), to identify skbs that unix_gc()
has to special case.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-08 13:21:47 +00:00
Linus Torvalds 53ae7e1176 io_uring-6.3-2023-03-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmQB58gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvz0EACk4hUVpReCU7mspo+eAnzo1+Oi5DHdm34C
 t4DU00lJ107Als5i86kP95zK2UHMXMsTt6N0fVwmDd/qQ8zhMwxH5sjyHVsknvGa
 sNLDCcA8yYwhfnhMnRIk024Rvn9p3U4CThSKF6G7g78absm5ULu/UL3MSiI4Q8aM
 gfK2MlD6aZLTZKLYefOI4fMBwLvJoZ3/mwpzdBfijJT/qUsYhrg8rJ9Pxo+JvtNf
 Oxny83DnT2i7ggVZraNouuD972EGf9Lm4TAmWV5l5nG4aKYZfRny+QwA48Eghs5+
 EpUNtpdvL5pguZIJLI77LSkNmJA/0oFsjh/qGOYcPZfS9erY3KOr8+jBh9FKdLC+
 5AZVt8pnSKqWT1EjKmpEThbaPgzXEhiFYq9kKO4gG5VhSgD5oxC1MXhwb8QJ0gOm
 9PR9pjzwPyYqMj7yAne4lY7I0OujHgYi9nrSpW2mLeeWjRJmkYXjB9NYxNnwiOLS
 H5SzbzHafn8HAuLacwp3StqrwqCE4bx+dTCYgLj/puN2IFEre/xtvy4S6sG2tPAF
 1hJwvV2cYqqK9Khx5fBHv2GAlMlozqe8RwQF84VGTuu+Jturo73CNtlkTGPT2Yds
 SJCTNA05uo0WsvNdiwlMWUW+nXzMFuzu0stb/Gg14KjAcS8CxYGKRRP8kFAoxrQY
 LtmaNYcukQ==
 =qWNf
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.3-2023-03-03' of git://git.kernel.dk/linux

Pull more io_uring updates from Jens Axboe:
 "Here's a set of fixes/changes that didn't make the first cut, either
  because they got queued before I sent the early merge request, or
  fixes that came in afterwards. In detail:

   - Don't set MSG_NOSIGNAL on recv/recvmsg opcodes, as AF_PACKET will
     error out (David)

   - Fix for spurious poll wakeups (me)

   - Fix for a file leak for buffered reads in certain conditions
     (Joseph)

   - Don't allow registered buffers of mixed types (Pavel)

   - Improve handling of huge pages for registered buffers (Pavel)

   - Provided buffer ring size calculation fix (Wojciech)

   - Minor cleanups (me)"

* tag 'io_uring-6.3-2023-03-03' of git://git.kernel.dk/linux:
  io_uring/poll: don't pass in wake func to io_init_poll_iocb()
  io_uring: fix fget leak when fs don't support nowait buffered read
  io_uring/poll: allow some retries for poll triggering spuriously
  io_uring: remove MSG_NOSIGNAL from recvmsg
  io_uring/rsrc: always initialize 'folio' to NULL
  io_uring/rsrc: optimise registered huge pages
  io_uring/rsrc: optimise single entry advance
  io_uring/rsrc: disallow multi-source reg buffers
  io_uring: remove unused wq_list_merge
  io_uring: fix size calculation when registering buf ring
  io_uring/rsrc: fix a comment in io_import_fixed()
  io_uring: rename 'in_idle' to 'in_cancel'
  io_uring: consolidate the put_ref-and-return section of adding work
2023-03-03 10:25:29 -08:00
Linus Torvalds f122a08b19 capability: just use a 'u64' instead of a 'u32[2]' array
Back in 2008 we extended the capability bits from 32 to 64, and we did
it by extending the single 32-bit capability word from one word to an
array of two words.  It was then obfuscated by hiding the "2" behind two
macro expansions, with the reasoning being that maybe it gets extended
further some day.

That reasoning may have been valid at the time, but the last thing we
want to do is to extend the capability set any more.  And the array of
values not only causes source code oddities (with loops to deal with
it), but also results in worse code generation.  It's a lose-lose
situation.

So just change the 'u32[2]' into a 'u64' and be done with it.

We still have to deal with the fact that the user space interface is
designed around an array of these 32-bit values, but that was the case
before too, since the array layouts were different (ie user space
doesn't use an array of 32-bit values for individual capability masks,
but an array of 32-bit slices of multiple masks).

So that marshalling of data is actually simplified too, even if it does
remain somewhat obscure and odd.

This was all triggered by my reaction to the new "cap_isidentical()"
introduced recently.  By just using a saner data structure, it went from

	unsigned __capi;
	CAP_FOR_EACH_U32(__capi) {
		if (a.cap[__capi] != b.cap[__capi])
			return false;
	}
	return true;

to just being

	return a.val == b.val;

instead.  Which is rather more obvious both to humans and to compilers.

Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Moore <paul@paul-moore.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-01 10:01:22 -08:00
Jens Axboe 1947ddf9b3 io_uring/poll: don't pass in wake func to io_init_poll_iocb()
We only use one, and it's io_poll_wake(). Hardwire that in the initial
init, as well as in __io_queue_proc() if we're setting up for double
poll.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-01 10:06:53 -07:00
Joseph Qi 54aa7f2330 io_uring: fix fget leak when fs don't support nowait buffered read
Heming reported a BUG when using io_uring doing link-cp on ocfs2. [1]

Do the following steps can reproduce this BUG:
mount -t ocfs2 /dev/vdc /mnt/ocfs2
cp testfile /mnt/ocfs2/
./link-cp /mnt/ocfs2/testfile /mnt/ocfs2/testfile.1
umount /mnt/ocfs2

Then umount will fail, and it outputs:
umount: /mnt/ocfs2: target is busy.

While tracing umount, it blames mnt_get_count() not return as expected.
Do a deep investigation for fget()/fput() on related code flow, I've
finally found that fget() leaks since ocfs2 doesn't support nowait
buffered read.

io_issue_sqe
|-io_assign_file  // do fget() first
  |-io_read
  |-io_iter_do_read
    |-ocfs2_file_read_iter  // return -EOPNOTSUPP
  |-kiocb_done
    |-io_rw_done
      |-__io_complete_rw_common  // set REQ_F_REISSUE
    |-io_resubmit_prep
      |-io_req_prep_async  // override req->file, leak happens

This was introduced by commit a196c78b54 in v5.18. Fix it by don't
re-assign req->file if it has already been assigned.

[1] https://lore.kernel.org/ocfs2-devel/ab580a75-91c8-d68a-3455-40361be1bfa8@linux.alibaba.com/T/#t

Fixes: a196c78b54 ("io_uring: assign non-fixed early for async work")
Cc: <stable@vger.kernel.org>
Reported-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230228045459.13524-1-joseph.qi@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-28 05:58:42 -07:00
Jens Axboe c16bda3759 io_uring/poll: allow some retries for poll triggering spuriously
If we get woken spuriously when polling and fail the operation with
-EAGAIN again, then we generally only allow polling again if data
had been transferred at some point. This is indicated with
REQ_F_PARTIAL_IO. However, if the spurious poll triggers when the socket
was originally empty, then we haven't transferred data yet and we will
fail the poll re-arm. This either punts the socket to io-wq if it's
blocking, or it fails the request with -EAGAIN if not. Neither condition
is desirable, as the former will slow things down, while the latter
will make the application confused.

We want to ensure that a repeated poll trigger doesn't lead to infinite
work making no progress, that's what the REQ_F_PARTIAL_IO check was
for. But it doesn't protect against a loop post the first receive, and
it's unnecessarily strict if we started out with an empty socket.

Add a somewhat random retry count, just to put an upper limit on the
potential number of retries that will be done. This should be high enough
that we won't really hit it in practice, unless something needs to be
aborted anyway.

Cc: stable@vger.kernel.org # v5.10+
Link: https://github.com/axboe/liburing/issues/364
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-25 20:10:13 -07:00
David Lamparter 7605c43d67 io_uring: remove MSG_NOSIGNAL from recvmsg
MSG_NOSIGNAL is not applicable for the receiving side, SIGPIPE is
generated when trying to write to a "broken pipe".  AF_PACKET's
packet_recvmsg() does enforce this, giving back EINVAL when MSG_NOSIGNAL
is set - making it unuseable in io_uring's recvmsg.

Remove MSG_NOSIGNAL from io_recvmsg_prep().

Cc: stable@vger.kernel.org # v5.10+
Signed-off-by: David Lamparter <equinox@diac24.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230224150123.128346-1-equinox@diac24.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-24 12:59:02 -07:00
Jens Axboe 977bc87356 io_uring/rsrc: always initialize 'folio' to NULL
Smatch complains that:

smatch warnings:
io_uring/rsrc.c:1262 io_sqe_buffer_register() error: uninitialized symbol 'folio'.

'folio' may be used uninitialized, which can happen if we end up with a
single page mapped. Ensure that we clear folio to NULL at the top so
it's always set.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/r/202302241432.YML1CD5C-lkp@intel.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-24 12:58:31 -07:00
Linus Torvalds 3822a7c409 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X bit.
 
 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.
 
 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes
 
 - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
   does perform some memcg maintenance and cleanup work.
 
 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".  These filters provide users
   with finer-grained control over DAMOS's actions.  SeongJae has also done
   some DAMON cleanup work.
 
 - Kairui Song adds a series ("Clean up and fixes for swap").
 
 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".
 
 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series.  It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.
 
 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".
 
 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".
 
 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".
 
 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".
 
 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series "mm:
   support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
   PTEs".
 
 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".
 
 - Sergey Senozhatsky has improved zsmalloc's memory utilization with his
   series "zsmalloc: make zspage chain size configurable".
 
 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.  The previous BPF-based approach had
   shortcomings.  See "mm: In-kernel support for memory-deny-write-execute
   (MDWE)".
 
 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
 
 - T.J.  Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".
 
 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a per-node
   basis.  See the series "Introduce per NUMA node memory error
   statistics".
 
 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage during
   compaction".
 
 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".
 
 - Christoph Hellwig has removed block_device_operations.rw_page() in ths
   series "remove ->rw_page".
 
 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".
 
 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier functions".
 
 - Some pagemap cleanup and generalization work in Mike Rapoport's series
   "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
   "fixups for generic implementation of pfn_valid()"
 
 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
 
 - Jason Gunthorpe rationalized the GUP system's interface to the rest of
   the kernel in the series "Simplify the external interface for GUP".
 
 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface.  To support this, we'll temporarily be
   printing warnings when people use the debugfs interface.  See the series
   "mm/damon: deprecate DAMON debugfs interface".
 
 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.
 
 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".
 
 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
 jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
 DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
 =MlGs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
   F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X
   bit.

 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.

 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes

 - Johannes Weiner has a series ("mm: push down lock_page_memcg()")
   which does perform some memcg maintenance and cleanup work.

 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".

   These filters provide users with finer-grained control over DAMOS's
   actions. SeongJae has also done some DAMON cleanup work.

 - Kairui Song adds a series ("Clean up and fixes for swap").

 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".

 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.

 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".

 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".

 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".

 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".

 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series
   "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
   swap PTEs".

 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".

 - Sergey Senozhatsky has improved zsmalloc's memory utilization with
   his series "zsmalloc: make zspage chain size configurable".

 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.

   The previous BPF-based approach had shortcomings. See "mm: In-kernel
   support for memory-deny-write-execute (MDWE)".

 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

 - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".

 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a
   per-node basis. See the series "Introduce per NUMA node memory error
   statistics".

 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage
   during compaction".

 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".

 - Christoph Hellwig has removed block_device_operations.rw_page() in
   ths series "remove ->rw_page".

 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".

 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier
   functions".

 - Some pagemap cleanup and generalization work in Mike Rapoport's
   series "mm, arch: add generic implementation of pfn_valid() for
   FLATMEM" and "fixups for generic implementation of pfn_valid()"

 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

 - Jason Gunthorpe rationalized the GUP system's interface to the rest
   of the kernel in the series "Simplify the external interface for
   GUP".

 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface. To support this, we'll temporarily be
   printing warnings when people use the debugfs interface. See the
   series "mm/damon: deprecate DAMON debugfs interface".

 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.

 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".

 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
  include/linux/migrate.h: remove unneeded externs
  mm/memory_hotplug: cleanup return value handing in do_migrate_range()
  mm/uffd: fix comment in handling pte markers
  mm: change to return bool for isolate_movable_page()
  mm: hugetlb: change to return bool for isolate_hugetlb()
  mm: change to return bool for isolate_lru_page()
  mm: change to return bool for folio_isolate_lru()
  objtool: add UACCESS exceptions for __tsan_volatile_read/write
  kmsan: disable ftrace in kmsan core code
  kasan: mark addr_has_metadata __always_inline
  mm: memcontrol: rename memcg_kmem_enabled()
  sh: initialize max_mapnr
  m68k/nommu: add missing definition of ARCH_PFN_OFFSET
  mm: percpu: fix incorrect size in pcpu_obj_full_size()
  maple_tree: reduce stack usage with gcc-9 and earlier
  mm: page_alloc: call panic() when memoryless node allocation fails
  mm: multi-gen LRU: avoid futile retries
  migrate_pages: move THP/hugetlb migration support check to simplify code
  migrate_pages: batch flushing TLB
  migrate_pages: share more code between _unmap and _move
  ...
2023-02-23 17:09:35 -08:00
Pavel Begunkov 57bebf807e io_uring/rsrc: optimise registered huge pages
When registering huge pages, internally io_uring will split them into
many PAGE_SIZE bvec entries. That's bad for performance as drivers need
to eventually dma-map the data and will do it individually for each bvec
entry. Coalesce huge pages into one large bvec.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-22 09:57:24 -07:00
Pavel Begunkov b000ae0ec2 io_uring/rsrc: optimise single entry advance
Iterating within the first bvec entry should be essentially free, but we
use iov_iter_advance() for that, which shows up in benchmark profiles
taking up to 0.5% of CPU. Replace it with a hand coded version.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-22 09:57:23 -07:00
Pavel Begunkov edd4782696 io_uring/rsrc: disallow multi-source reg buffers
If two or more mappings go back to back to each other they can be passed
into io_uring to be registered as a single registered buffer. That would
even work if mappings came from different sources, e.g. it's possible to
mix in this way anon pages and pages from shmem or hugetlb. That is not
a problem but it'd rather be less prone if we forbid such mixing.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-22 09:57:23 -07:00
Pavel Begunkov 9a1563d172 io_uring: remove unused wq_list_merge
There are no users of wq_list_merge, kill it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5f9ad0301949213230ad9000a8359d591aae615a.1677002255.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-22 09:57:23 -07:00
Wojciech Lukowicz 48ba08374e io_uring: fix size calculation when registering buf ring
Using struct_size() to calculate the size of io_uring_buf_ring will sum
the size of the struct and of the bufs array. However, the struct's fields
are overlaid with the array making the calculated size larger than it
should be.

When registering a ring with N * PAGE_SIZE / sizeof(struct io_uring_buf)
entries, i.e. with fully filled pages, the calculated size will span one
more page than it should and io_uring will try to pin the following page.
Depending on how the application allocated the ring, it might succeed
using an unrelated page or fail returning EFAULT.

The size of the ring should be the product of ring_entries and the size
of io_uring_buf, i.e. the size of the bufs array only.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Wojciech Lukowicz <wlukowicz01@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20230218184141.70891-1-wlukowicz01@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-22 09:57:23 -07:00
Pavel Begunkov 6bf65a1b36 io_uring/rsrc: fix a comment in io_import_fixed()
io_import_fixed() supports offsets, but "may not" means the opposite.
Replace it with "might not" so the comments rather speaks about
possible cases.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/5b5f79958456caa6dc532f6205f75f224b232c81.1676902343.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-22 09:57:23 -07:00
Jens Axboe 8d664282a0 io_uring: rename 'in_idle' to 'in_cancel'
This better describes what it does - it's incremented when the task is
currently undergoing a cancelation operation, due to exiting or exec'ing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-22 09:57:23 -07:00
Jens Axboe ce8e04f6e5 io_uring: consolidate the put_ref-and-return section of adding work
We've got a few cases of this, move them to one section and just use
gotos to get there. Reduces the text section on both arm64 and x86-64,
using gcc-12.2.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-22 09:57:23 -07:00
Linus Torvalds 5b0ed59649 for-6.3/block-2023-02-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPvfncQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpob2EADXJxcr2jjYHm/7cjKkyuVX8fr80dNdMeuY
 JFdsjG1k6Uj73BVhQQWYTcs/PsrWBHWRsv6uz4WgOELj55eXmf5Q0kJszyUeJW33
 /DjqLvtoppVcYf80xE13wKvCfn73BjwQo6xkGM0qAYn15eaXiD/Ax3xC6eJlsBeK
 PEw7EJyhacbSxZa/1D2B6+mqII1jUQWProTCc3udZ4JHi3WvdWa3Rda0qCqHl4a1
 +K2aP2YTFIRPxBzfMNa/CafWVIFubTdht+4Ds6R60RImzB9e0VUBfcsiUyW5Zg7L
 Fwv7ptXuWrALwVNdW56Oz1QikBxn2pdRR2HMLwKJW1MD8kP9r8LMm2jV5Rhiwe0B
 OQsGRYkOzBvw+bxeP5fvk0iPGVMz6ActH4gkraA5QdLqayDaFYOadlhqz0uRo5SH
 Fb42Vl658K/MHDSIk8U58TNkmrsIJsBGohXI9DOGINPPvv3XOPi4Q1HmXkGRmii0
 y+lNU/QEGh7xXXew29SPP76uQpQaYfC7NxXCMw/OpOMwehzjsjshmM2lpxi8zsgt
 PJUmfHv5qxCplNmTJXmUpmX7sS7550HUdu9FJb13DM+gzKg8bk9jWVuLrzqrVlG5
 1hKWEl1+heg1heRfaIuJVLbPI0au6Sb4uqhih/PHyrP9TWIoAruDbDJM65GKTxyE
 2uEgcHzHQw==
 =poRc
 -----END PGP SIGNATURE-----

Merge tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe updates via Christoph:
      - Small improvements to the logging functionality (Amit Engel)
      - Authentication cleanups (Hannes Reinecke)
      - Cleanup and optimize the DMA mapping cod in the PCIe driver
        (Keith Busch)
      - Work around the command effects for Format NVM (Keith Busch)
      - Misc cleanups (Keith Busch, Christoph Hellwig)
      - Fix and cleanup freeing single sgl (Keith Busch)

 - MD updates via Song:
      - Fix a rare crash during the takeover process
      - Don't update recovery_cp when curr_resync is ACTIVE
      - Free writes_pending in md_stop
      - Change active_io to percpu

 - Updates to drbd, inching us closer to unifying the out-of-tree driver
   with the in-tree one (Andreas, Christoph, Lars, Robert)

 - BFQ update adding support for multi-actuator drives (Paolo, Federico,
   Davide)

 - Make brd compliant with REQ_NOWAIT (me)

 - Fix for IOPOLL and queue entering, fixing stalled IO waiting on
   timeouts (me)

 - Fix for REQ_NOWAIT with multiple bios (me)

 - Fix memory leak in blktrace cleanup (Greg)

 - Clean up sbitmap and fix a potential hang (Kemeng)

 - Clean up some bits in BFQ, and fix a bug in the request injection
   (Kemeng)

 - Clean up the request allocation and issue code, and fix some bugs
   related to that (Kemeng)

 - ublk updates and fixes:
      - Add support for unprivileged ublk (Ming)
      - Improve device deletion handling (Ming)
      - Misc (Liu, Ziyang)

 - s390 dasd fixes (Alexander, Qiheng)

 - Improve utility of request caching and fixes (Anuj, Xiao)

 - zoned cleanups (Pankaj)

 - More constification for kobjs (Thomas)

 - blk-iocost cleanups (Yu)

 - Remove bio splitting from drivers that don't need it (Christoph)

 - Switch blk-cgroups to use struct gendisk. Some of this is now
   incomplete as select late reverts were done. (Christoph)

 - Add bvec initialization helpers, and convert callers to use that
   rather than open-coding it (Christoph)

 - Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin,
   Matthew, Ulf, Zhong)

* tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits)
  brd: use radix_tree_maybe_preload instead of radix_tree_preload
  block: use proper return value from bio_failfast()
  block: bio-integrity: Copy flags when bio_integrity_payload is cloned
  block: Fix io statistics for cgroup in throttle path
  brd: mark as nowait compatible
  brd: check for REQ_NOWAIT and set correct page allocation mask
  brd: return 0/-error from brd_insert_page()
  block: sync mixed merged request's failfast with 1st bio's
  Revert "blk-cgroup: pin the gendisk in struct blkcg_gq"
  Revert "blk-cgroup: pass a gendisk to blkg_lookup"
  Revert "blk-cgroup: delay blk-cgroup initialization until add_disk"
  Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release"
  Revert "blk-cgroup: move the cgroup information to struct gendisk"
  nvme-pci: remove iod use_sgls
  nvme-pci: fix freeing single sgl
  block: ublk: check IO buffer based on flag need_get_data
  s390/dasd: Fix potential memleak in dasd_eckd_init()
  s390/dasd: sort out physical vs virtual pointers usage
  block: Remove the ALLOC_CACHE_SLACK constant
  block: make kobj_type structures constant
  ...
2023-02-20 14:27:21 -08:00
Linus Torvalds c1ef500307 for-6.3/iter-ubuf-2023-02-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPueOUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkEWD/9hOagNSeXfCd1eAJ44E5IemgHKqfU0RXRs
 kdW1o35eBXwPVAyhhDmcz60hkijm47Pw3IJUdSNaGqdm9uYpLwiatuYY5EOVC4qg
 BFkVPGCA8ERXStFM/mnWj0gkYDmb/8bzk9bdBU1FQvQOIQgYpomlHdMVfQJ+0tDT
 7VTffRaWfcxWd1u+NBMDxmfz47teplxiHJDg38wGlgT6G1kMdEUK+y6hd0SoASPM
 ocMW8LL2v3wLQhQAOWYd6sw2kFnxx4VOzhSepPAY0U78CR6CYm6zthRd+k+Ro/nt
 RFKL6Ijt2LRaOZqY3HRnCpUwmhBNft0ZFH4OHh21vPaukB4sjWbQ5SJniucNcoCN
 rb9jAJDJdS6oy+Uimeig99aQ/yGSLJXG8MQKrC36NdGSwydUfaCLaoLKwfC8zYDC
 Zr3G7tfOhSJQzQtNSH1H0SqHFvMfc7C2Ra8mYXdHbcREswKOTT73aJUHq5RFfwO+
 m10V5rQgCB9rJz0NLbo68GhxDrbTQuueDj+yDWCSoulUdNg3s2BZ3/iBjODJyJNO
 P3aG4bMYxC5te2JWCBnmR6du//8vnvDHnwWh9yKcUk+l/9OTtAPouAdUCv+r1wkz
 Ib0aEX3SiJ65LIePQO2kbdvgnweyFCJYduvMW9zjsH9GMgRP0eA6EKZh3mbKhOw4
 yw9BcZoNYQ==
 =+ImB
 -----END PGP SIGNATURE-----

Merge tag 'for-6.3/iter-ubuf-2023-02-16' of git://git.kernel.dk/linux

Pull io_uring ITER_UBUF conversion from Jens Axboe:
 "Since we now have ITER_UBUF available, switch to using it for single
  ranges as it's more efficient than ITER_IOVEC for that"

* tag 'for-6.3/iter-ubuf-2023-02-16' of git://git.kernel.dk/linux:
  block: use iter_ubuf for single range
  iov_iter: move iter_ubuf check inside restore WARN
  io_uring: use iter_ubuf for single range imports
  io_uring: switch network send/recv to ITER_UBUF
  iov: add import_ubuf()
2023-02-20 14:03:57 -08:00
Josh Triplett 7d3fd88d61 io_uring: Support calling io_uring_register with a registered ring fd
Add a new flag IORING_REGISTER_USE_REGISTERED_RING (set via the high bit
of the opcode) to treat the fd as a registered index rather than a file
descriptor.

This makes it possible for a library to open an io_uring, register the
ring fd, close the ring fd, and subsequently use the ring entirely via
registered index.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Link: https://lore.kernel.org/r/f2396369e638284586b069dbddffb8c992afba95.1676419314.git.josh@joshtriplett.org
[axboe: remove extra high bit clear]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-16 06:09:30 -07:00
Richard Guy Briggs fbe870a72f io_uring,audit: don't log IORING_OP_MADVISE
fadvise and madvise both provide hints for caching or access pattern for
file and memory respectively.  Skip them.

Fixes: 5bd2182d58 ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Link: https://lore.kernel.org/r/b5dfdcd541115c86dbc774aa9dd502c964849c5f.1675282642.git.rgb@redhat.com
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-10 16:00:30 -07:00
Jens Axboe 2f2bb1ffc9 io_uring: mark task TASK_RUNNING before handling resume/task work
Just like for task_work, set the task mode to TASK_RUNNING before doing
any potential resume work. We're not holding any locks at this point,
but we may have already set the task state to TASK_INTERRUPTIBLE in
preparation for going to sleep waiting for events. Ensure that we set it
back to TASK_RUNNING if we have work to process, to avoid warnings on
calling blocking operations with !TASK_RUNNING.

Fixes: b5d3ae202f ("io_uring: handle TIF_NOTIFY_RESUME when checking for task_work")
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202302062208.24d3e563-oliver.sang@intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-06 08:23:21 -07:00
Christoph Hellwig cc342a2193 io_uring: use bvec_set_page to initialize a bvec
Use the bvec_set_page helper to initialize a bvec.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230203150634.3199647-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03 10:17:42 -07:00
Dylan Yudaken 0ffae640ad io_uring: always go async for unsupported open flags
No point in issuing -> return -EAGAIN -> go async, when it can be done upfront.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20230127135227.3646353-5-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:26 -07:00
Dylan Yudaken c31cc60fdd io_uring: always go async for unsupported fadvise flags
No point in issuing -> return -EAGAIN -> go async, when it can be done upfront.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20230127135227.3646353-4-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:26 -07:00
Dylan Yudaken aebb224fd4 io_uring: for requests that require async, force it
Some requests require being run async as they do not support
non-blocking. Instead of trying to issue these requests, getting -EAGAIN
and then queueing them for async issue, rather just force async upfront.

Add WARN_ON_ONCE to make sure surprising code paths do not come up,
however in those cases the bug would end up being a blocking
io_uring_enter(2) which should not be critical.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20230127135227.3646353-3-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:26 -07:00
Dylan Yudaken 6bb3085556 io_uring: if a linked request has REQ_F_FORCE_ASYNC then run it async
REQ_F_FORCE_ASYNC was being ignored for re-queueing linked
requests. Instead obey that flag.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20230127135227.3646353-2-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:18:26 -07:00
Jens Axboe f586800854 io_uring: add reschedule point to handle_tw_list()
If CONFIG_PREEMPT_NONE is set and the task_work chains are long, we
could be running into issues blocking others for too long. Add a
reschedule check in handle_tw_list(), and flush the ctx if we need to
reschedule.

Cc: stable@vger.kernel.org # 5.10+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Jens Axboe fcc926bb85 io_uring: add a conditional reschedule to the IOPOLL cancelation loop
If the kernel is configured with CONFIG_PREEMPT_NONE, we could be
sitting in a tight loop reaping events but not giving them a chance to
finish. This results in a trace ala:

rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 	2-...!: (5249 ticks this GP) idle=935c/1/0x4000000000000000 softirq=4265/4274 fqs=1
	(t=5251 jiffies g=465 q=4135 ncpus=4)
rcu: rcu_sched kthread starved for 5249 jiffies! g465 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_sched       state:R  running task     stack:0     pid:12    ppid:2      flags:0x00000008
Call trace:
 __switch_to+0xb0/0xc8
 __schedule+0x43c/0x520
 schedule+0x4c/0x98
 schedule_timeout+0xbc/0xdc
 rcu_gp_fqs_loop+0x308/0x344
 rcu_gp_kthread+0xd8/0xf0
 kthread+0xb8/0xc8
 ret_from_fork+0x10/0x20
rcu: Stack dump where RCU GP kthread last ran:
Task dump for CPU 0:
task:kworker/u8:10   state:R  running task     stack:0     pid:89    ppid:2      flags:0x0000000a
Workqueue: events_unbound io_ring_exit_work
Call trace:
 __switch_to+0xb0/0xc8
 0xffff0000c8fefd28
CPU: 2 PID: 95 Comm: kworker/u8:13 Not tainted 6.2.0-rc5-00042-g40316e337c80-dirty #2759
Hardware name: linux,dummy-virt (DT)
Workqueue: events_unbound io_ring_exit_work
pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
pc : io_do_iopoll+0x344/0x360
lr : io_do_iopoll+0xb8/0x360
sp : ffff800009bebc60
x29: ffff800009bebc60 x28: 0000000000000000 x27: 0000000000000000
x26: ffff0000c0f67d48 x25: ffff0000c0f67840 x24: ffff800008950024
x23: 0000000000000001 x22: 0000000000000000 x21: ffff0000c27d3200
x20: ffff0000c0f67840 x19: ffff0000c0f67800 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
x14: 0000000000000001 x13: 0000000000000001 x12: 0000000000000000
x11: 0000000000000179 x10: 0000000000000870 x9 : ffff800009bebd60
x8 : ffff0000c27d3ad0 x7 : fefefefefefefeff x6 : 0000646e756f626e
x5 : ffff0000c0f67840 x4 : 0000000000000000 x3 : ffff0000c2398000
x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
Call trace:
 io_do_iopoll+0x344/0x360
 io_uring_try_cancel_requests+0x21c/0x334
 io_ring_exit_work+0x90/0x40c
 process_one_work+0x1a4/0x254
 worker_thread+0x1ec/0x258
 kthread+0xb8/0xc8
 ret_from_fork+0x10/0x20

Add a cond_resched() in the cancelation IOPOLL loop to fix this.

Cc: stable@vger.kernel.org # 5.10+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov 50470fc572 io_uring: return normal tw run linking optimisation
io_submit_flush_completions() may produce new task_work items, so it's a
good idea to recheck the task_work list after flushing completions. The
optimisation is not new and was accidentially removed by
f88262e60b ("io_uring: lockless task list")

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a7ed5ede84de190832cc33ebbcdd6e91cd90f5b6.1674484266.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov cb6bf7f285 io_uring: refactor tctx_task_work
Merge almost identical sections of tctx_task_work(), this will make code
modifications later easier and also inlines handle_tw_list().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d06592d91e3e7559e7a4dbb8907d110863008dc7.1674484266.git.asml.silence@gmail.com
[axboe: fold in setting count to zero patch from Tom Rix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov 5afa465071 io_uring: refactor io_put_task helpers
Add a helper for putting refs from the target task context, rename
__io_put_task() and add a couple of comments around. Use the remote
version for __io_req_complete_post(), the local is only needed for
__io_submit_flush_completions().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3bf92ebd594769d8a5d648472a8e335f2031d542.1674484266.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov c8576f3e61 io_uring: refactor req allocation
Follow the io_get_sqe pattern returning the result via a pointer
and hide request cache refill inside io_alloc_req().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8c37c2e8a3cb5e4cd6a8ae3b91371227a92708a6.1674484266.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov b5083dfa36 io_uring: improve io_get_sqe
Return an SQE from io_get_sqe() as a parameter and use the return value
to determine if it failed or not. This enables the compiler to compile out
the sqe NULL check when we know that the return SQE is valid.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9cceb11329240ea097dffef6bf0a675bca14cf42.1674484266.git.asml.silence@gmail.com
[axboe: remove bogus const modifier on return value]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov b2aa66aff6 io_uring: kill outdated comment about overflow flush
__io_cqring_overflow_flush() doesn't return anything anymore, remove
outdate comment.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4ce2bcbb17eac80cdf883fd1459d5ee6586e238c.1674484266.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov c10bb64684 io_uring: use user visible tail in io_uring_poll()
We return POLLIN from io_uring_poll() depending on whether there are
CQEs for the userspace, and so we should use the user visible tail
pointer instead of a transient cached value.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/228ffcbf30ba98856f66ffdb9a6a60ead1dd96c0.1674484266.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Jens Axboe f499254474 io_uring: pass in io_issue_def to io_assign_file()
This generates better code for me, avoiding an extra load on arm64, and
both call sites already have this variable available for easy passing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Breno Leitao c1755c25a7 io_uring: Enable KASAN for request cache
Every io_uring request is represented by struct io_kiocb, which is
cached locally by io_uring (not SLAB/SLUB) in the list called
submit_state.freelist. This patch simply enabled KASAN for this free
list.

This list is initially created by KMEM_CACHE, but later, managed by
io_uring. This patch basically poisons the objects that are not used
(i.e., they are the free list), and unpoisons it when the object is
allocated/removed from the list.

Touching these poisoned objects while in the freelist will cause a KASAN
warning.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Jens Axboe b5d3ae202f io_uring: handle TIF_NOTIFY_RESUME when checking for task_work
If TIF_NOTIFY_RESUME is set, then we need to call resume_user_mode_work()
for PF_IO_WORKER threads. They never return to usermode, hence never get
a chance to process any items that are marked by this flag. Most notably
this includes the final put of files, but also any throttling markers set
by block cgroups.

Cc: stable@vger.kernel.org # 5.10+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Jens Axboe 8572df941c io_uring/msg-ring: ensure flags passing works for task_work completions
If the target ring is using IORING_SETUP_SINGLE_ISSUER and we're posting
a message from a different thread, then we need to ensure that the
fallback task_work that posts the CQE knwos about the flags passing as
well. If not we'll always be posting 0 as the flags.

Fixes: 3563d7ed58a5 ("io_uring/msg_ring: Pass custom flags to the cqe")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Breno Leitao f30bd4d038 io_uring: Split io_issue_def struct
This patch removes some "cold" fields from `struct io_issue_def`.

The plan is to keep only highly used fields into `struct io_issue_def`, so,
it may be hot in the cache. The hot fields are basically all the bitfields
and the callback functions for .issue and .prep.

The other less frequently used fields are now located in a secondary and
cold struct, called `io_cold_def`.

This is the size for the structs:

Before: io_issue_def = 56 bytes
After: io_issue_def = 24 bytes; io_cold_def = 40 bytes

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20230112144411.2624698-2-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Breno Leitao a7dd27828b io_uring: Rename struct io_op_def
The current io_op_def struct is becoming huge and the name is a bit
generic.

The goal of this patch is to rename this struct to `io_issue_def`. This
struct will contain the hot functions associated with the issue code
path.

For now, this patch only renames the structure, and an upcoming patch
will break up the structure in two, moving the non-issue fields to a
secondary struct.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20230112144411.2624698-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov 68a2cc1bba io_uring: refactor __io_req_complete_post
Keep parts of __io_req_complete_post() relying on req->flags together so
the value can be cached.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2b4fbb42f404a0e75c4d9f0a5b16f314a839d0a9.1673887636.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov 31f084b7b0 io_uring: simplify fallback execution
Lock the ring with uring_lock in io_fallback_req_func(), which should
make it a bit safer and easier. With that we also don't need refs
pinning as io_ring_exit_work() will wait until uring_lock is freed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/56170e6a0cbfc8edee2794c6613e8f6f1d76d276.1673887636.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov 89800a2dd5 io_uring: don't export io_put_task()
io_put_task() is only used in uring.c so enclose it there together with
__io_put_task().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/43c7f9227e2ab215f1a6069dadbc5382bed346fe.1673887636.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Pavel Begunkov b0b7a7d24b io_uring: return back links tw run optimisation
io_submit_flush_completions() may queue new requests for tw execution,
especially true for linked requests. Recheck the tw list for emptiness
after flushing completions.

Note that this doesn't really fix the commit referenced below, but it
does reinstate an optimization that existed before that got merged.

Fixes: f88262e60b ("io_uring: lockless task list")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6328acdbb5e60efc762b18003382de077e6e1367.1673887636.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:41 -07:00
Quanfa Fu 88b80534f6 io_uring: make io_sqpoll_wait_sq return void
Change the return type to void since it always return 0, and no need
to do the checking in syscall io_uring_enter.

Signed-off-by: Quanfa Fu <quanfafu@gmail.com>
Link: https://lore.kernel.org/r/20230115071519.554282-1-quanfafu@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov c3f4d39ee4 io_uring: optimise deferred tw execution
We needed fake nodes in __io_run_local_work() and to avoid unecessary wake
ups while the task already running task_works, but we don't need them
anymore since wake ups are protected by cq_waiting, which is always
cleared by the time we're executing deferred task_work items.

Note that because of loose sync around cq_waiting clearing
io_req_local_work_add() may wake the task more than once, but that's
fine and should be rare to not hurt perf.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8839534891f0a2f1076e78554a31ea7e099f7de5.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov d80c0f00d0 io_uring: add io_req_local_work_add wake fast path
Don't wake the master task after queueing a deferred tw unless it's
currently waiting in io_cqring_wait.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/717702d772825a6647e6c315b4690277ba84c3fc.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov 130bd686d9 io_uring: waitqueue-less cq waiting
With DEFER_TASKRUN only ctx->submitter_task might be waiting for CQEs,
we can use this to optimise io_cqring_wait(). Replace ->cq_wait
waitqueue with waking the task directly.

It works but misses an important optimisation covered by the following
patch, so this patch without follow ups might hurt performance.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/103d174d35d919d4cb0922d8a9c93a8f0c35f74a.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov 3181e22fb7 io_uring: wake up optimisations
Flush completions is done either from the submit syscall or by the
task_work, both are in the context of the submitter task, and when it
goes for a single threaded rings like implied by ->task_complete, there
won't be any waiters on ->cq_wait but the master task. That means that
there can be no tasks sleeping on cq_wait while we run
__io_submit_flush_completions() and so waking up can be skipped.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/60ad9768ec74435a0ddaa6eec0ffa7729474f69f.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov bca39f3905 io_uring: add lazy poll_wq activation
Even though io_poll_wq_wake()'s waitqueue_active reuses a barrier we do
for another waitqueue, it's not going to be the case in the future and
so we want to have a fast path for it when the ring has never been
polled.

Move poll_wq wake ups into __io_commit_cqring_flush() using a new flag
called ->poll_activated. The idea behind the flag is to set it when the
ring was polled for the first time. This requires additional sync to not
miss events, which is done here by using task_work for ->task_complete
rings, and by default enabling the flag for all other types of rings.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/060785e8e9137a920b232c0c7f575b131af19cac.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov 7b235dd82a io_uring: separate wq for ring polling
Don't use ->cq_wait for ring polling but add a separate wait queue for
it. We need it for following patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dea0be0bf990503443c5c6c337fc66824af7d590.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov 360173ab9e io_uring: move io_run_local_work_locked
io_run_local_work_locked() is only used in io_uring.c, move it there.
With that we can also make __io_run_local_work() static.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/91757bcb33e5774e49fed6f2b6e058630608119b.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov 3e5655552a io_uring: mark io_run_local_work static
io_run_local_work is enclosed in io_uring.c, we don't need to export it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b477fb81f5e77044f724a06fe245d5c078659364.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov 2f413956cc io_uring: don't set TASK_RUNNING in local tw runner
The CQ waiting loop sets TASK_RUNNING before trying to execute
task_work, no need to repeat it in io_run_local_work().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9d9422c429ef3f9457b4f4b8288bf4789564f33b.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov bd550173ac io_uring: refactor io_wake_function
Remove a local variable ctx in io_wake_function(), we don't need it if
io_should_wake() triggers it to wake up.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e60eb1008aebe286aab7d34c772ed01c447bddb1.1673274244.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Dmitrii Bundin 81594e7e7a io_uring: remove excessive unlikely on IS_ERR
The IS_ERR function uses the IS_ERR_VALUE macro under the hood which
already wraps the condition into unlikely.

Signed-off-by: Dmitrii Bundin <dmitrii.bundin.a@gmail.com>
Link: https://lore.kernel.org/r/20230109185854.25698-1-dmitrii.bundin.a@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Breno Leitao cbeb47a7b5 io_uring/msg_ring: Pass custom flags to the cqe
This patch adds a new flag (IORING_MSG_RING_FLAGS_PASS) in the message
ring operations (IORING_OP_MSG_RING). This new flag enables the sender
to specify custom flags, which will be copied over to cqe->flags in the
receiving ring.  These custom flags should be specified using the
sqe->file_index field.

This mechanism provides additional flexibility when sending messages
between rings.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20230103160507.617416-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov d33a39e577 io_uring: keep timeout in io_wait_queue
Move waiting timeout into io_wait_queue

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e4b48a9e26a3b1cf97c80121e62d4b5ab873d28d.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov 46ae7eef44 io_uring: optimise non-timeout waiting
Unlike the jiffy scheduling version, schedule_hrtimeout() jumps a few
functions before getting into schedule() even if there is no actual
timeout needed. Some tests showed that it takes up to 1% of CPU cycles.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/89f880574eceee6f4899783377ead234df7b3d04.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov 326a9e482e io_uring: set TASK_RUNNING right after schedule
Instead of constantly watching that the state of the task is running
before executing tw or taking locks in io_cqring_wait(), switch it back
to TASK_RUNNING immediately.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/246dddee247d89fd52023f785ed17cc34962a008.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov 490c00eb4f io_uring: simplify io_has_work
->work_llist should never be non-empty for a non DEFER_TASKRUN ring, so
we can safely skip checking the flag.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/26af9f73c09a56c9a035f94db56127358688f3aa.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov 846072f16e io_uring: mimimise io_cqring_wait_schedule
io_cqring_wait_schedule() is called after we started waiting on the cq
wq and set the state to TASK_INTERRUPTIBLE, for that reason we have to
constantly worry whether we has returned the state back to running or
not. Leave only quick checks in io_cqring_wait_schedule() and move the
rest including running task work to the callers. Note, we run tw in the
loop after the sched checks because of the fast path in the beginning of
the function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2814fabe75e2e019e7ca43ea07daa94564349805.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:40 -07:00
Pavel Begunkov 3fcf19d592 io_uring: parse check_cq out of wq waiting
We already avoid flushing overflows in io_cqring_wait_schedule() but
only return an error for the outer loop to handle it. Minimise it even
further by moving all ->check_cq parsing there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9dfcec3121013f98208dbf79368d636d74e1231a.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:39 -07:00
Pavel Begunkov 140102ae9a io_uring: move defer tw task checks
Most places that want to run local tw explicitly and in advance check if
they are allowed to do so. Don't rely on a similar check in
__io_run_local_work(), leave it as a just-in-case warning and make sure
callers checks capabilities themselves.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/990fe0e8e70fd4d57e43625e5ce8fba584821d1a.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:39 -07:00
Pavel Begunkov 1414d62985 io_uring: kill io_run_task_work_ctx
There is only one user of io_run_task_work_ctx(), inline it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/40953c65f7c88fb00cdc4d870ca5d5319fb3d7ea.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:39 -07:00
Pavel Begunkov f36ba6cf1a io_uring: don't iterate cq wait fast path
Task work runners keep running until all queues tw items are exhausted.
It's also rare for defer tw to queue normal tw and vise versa. Taking it
into account, there is only a dim chance that further iterating the
io_cqring_wait() fast path will get us anything and so we can remove
the loop there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1f9565726661266abaa5d921e97433c831759ecf.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:39 -07:00
Pavel Begunkov 0c4fe008c9 io_uring: rearrange defer list checks
There should be nothing in the ->work_llist for non DEFER_TASKRUN rings,
so we can skip flag checks and test the list emptiness directly. Also
move it out of io_run_local_work() for inlining.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/331d63fd15ca79b35b95c82a82d9246110686392.1672916894.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29 15:17:39 -07:00
Dylan Yudaken ef5c600adb io_uring: always prep_async for drain requests
Drain requests all go through io_drain_req, which has a quick exit in case
there is nothing pending (ie the drain is not useful). In that case it can
run the issue the request immediately.

However for safety it queues it through task work.
The problem is that in this case the request is run asynchronously, but
the async work has not been prepared through io_req_prep_async.

This has not been a problem up to now, as the task work always would run
before returning to userspace, and so the user would not have a chance to
race with it.

However - with IORING_SETUP_DEFER_TASKRUN - this is no longer the case and
the work might be defered, giving userspace a chance to change data being
referred to in the request.

Instead _always_ prep_async for drain requests, which is simpler anyway
and removes this issue.

Cc: stable@vger.kernel.org
Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20230127105911.2420061-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-27 06:29:29 -07:00
Jens Axboe b00c51ef8f io_uring/net: cache provided buffer group value for multishot receives
If we're using ring provided buffers with multishot receive, and we end
up doing an io-wq based issue at some points that also needs to select
a buffer, we'll lose the initially assigned buffer group as
io_ring_buffer_select() correctly clears the buffer group list as the
issue isn't serialized by the ctx uring_lock. This is fine for normal
receives as the request puts the buffer and finishes, but for multishot,
we will re-arm and do further receives. On the next trigger for this
multishot receive, the receive will try and pick from a buffer group
whose value is the same as the buffer ID of the las receive. That is
obviously incorrect, and will result in a premature -ENOUFS error for
the receive even if we had available buffers in the correct group.

Cache the buffer group value at prep time, so we can restore it for
future receives. This only needs doing for the above mentioned case, but
just do it by default to keep it easier to read.

Cc: stable@vger.kernel.org
Fixes: b3fdea6ecb ("io_uring: multishot recv")
Fixes: 9bb66906f2 ("io_uring: support multishot in recvmsg")
Cc: Dylan Yudaken <dylany@meta.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-23 07:08:08 -07:00
Jens Axboe 8caa03f10b io_uring/poll: don't reissue in case of poll race on multishot request
A previous commit fixed a poll race that can occur, but it's only
applicable for multishot requests. For a multishot request, we can safely
ignore a spurious wakeup, as we never leave the waitqueue to begin with.

A blunt reissue of a multishot armed request can cause us to leak a
buffer, if they are ring provided. While this seems like a bug in itself,
it's not really defined behavior to reissue a multishot request directly.
It's less efficient to do so as well, and not required to rearm anything
like it is for singleshot poll requests.

Cc: stable@vger.kernel.org
Fixes: 6e5aedb932 ("io_uring/poll: attempt request issue after racy poll wakeup")
Reported-and-tested-by: Olivier Langlois <olivier@trillion01.com>
Link: https://github.com/axboe/liburing/issues/778
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-20 15:11:54 -07:00
Pavel Begunkov 8579538c89 io_uring/msg_ring: fix remote queue to disabled ring
IORING_SETUP_R_DISABLED rings don't have the submitter task set, so
it's not always safe to use ->submitter_task. Disallow posting msg_ring
messaged to disabled rings. Also add task NULL check for loosy sync
around testing for IORING_SETUP_R_DISABLED.

Cc: stable@vger.kernel.org
Fixes: 6d043ee116 ("io_uring: do msg_ring in target task via tw")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-20 09:49:34 -07:00
Pavel Begunkov 56d8e3180c io_uring/msg_ring: fix flagging remote execution
There is a couple of problems with queueing a tw in io_msg_ring_data()
for remote execution. First, once we queue it the target ring can
go away and so setting IORING_SQ_TASKRUN there is not safe. Secondly,
the userspace might not expect IORING_SQ_TASKRUN.

Extract a helper and uniformly use TWA_SIGNAL without TWA_SIGNAL_NO_IPI
tricks for now, just as it was done in the original patch.

Cc: stable@vger.kernel.org
Fixes: 6d043ee116 ("io_uring: do msg_ring in target task via tw")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-20 09:49:29 -07:00
Jens Axboe e12d7a46f6 io_uring/msg_ring: fix missing lock on overflow for IOPOLL
If the target ring is configured with IOPOLL, then we always need to hold
the target ring uring_lock before posting CQEs. We could just grab it
unconditionally, but since we don't expect many target rings to be of this
type, make grabbing the uring_lock conditional on the ring type.

Link: https://lore.kernel.org/io-uring/Y8krlYa52%2F0YGqkg@ip-172-31-85-199.ec2.internal/
Reported-by: Xingyuan Mo <hdthky0@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-19 10:43:59 -07:00
Jens Axboe 423d5081d0 io_uring/msg_ring: move double lock/unlock helpers higher up
In preparation for needing them somewhere else, move them and get rid of
the unused 'issue_flags' for the unlock side.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-19 09:01:27 -07:00
David Hildenbrand fc4f4be9b5 mm/nommu: factor out check for NOMMU shared mappings into is_nommu_shared_mapping()
Patch series "mm/nommu: don't use VM_MAYSHARE for MAP_PRIVATE mappings".

Trying to reduce the confusion around VM_SHARED and VM_MAYSHARE first
requires !CONFIG_MMU to stop using VM_MAYSHARE for MAP_PRIVATE mappings. 
CONFIG_MMU only sets VM_MAYSHARE for MAP_SHARED mappings.

This paves the way for further VM_MAYSHARE and VM_SHARED cleanups: for
example, renaming VM_MAYSHARED to VM_MAP_SHARED to make it cleaner what is
actually means.

Let's first get the weird case out of the way and not use VM_MAYSHARE in
MAP_PRIVATE mappings, using a new VM_MAYOVERLAY flag instead.


This patch (of 3):

We want to stop using VM_MAYSHARE in private mappings to pave the way for
clarifying the semantics of VM_MAYSHARE vs.  VM_SHARED and reduce the
confusion.  While CONFIG_MMU uses VM_MAYSHARE to represent MAP_SHARED,
!CONFIG_MMU also sets VM_MAYSHARE for selected R/O private file mappings
that are an effective overlay of a file mapping.

Let's factor out all relevant VM_MAYSHARE checks in !CONFIG_MMU code into
is_nommu_shared_mapping() first.

Note that whenever VM_SHARED is set, VM_MAYSHARE must be set as well
(unless there is a serious BUG).  So there is not need to test for
VM_SHARED manually.

No functional change intended.

Link: https://lkml.kernel.org/r/20230102160856.500584-1-david@redhat.com
Link: https://lkml.kernel.org/r/20230102160856.500584-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:56 -08:00
Pavel Begunkov 544d163d65 io_uring: lock overflowing for IOPOLL
syzbot reports an issue with overflow filling for IOPOLL:

WARNING: CPU: 0 PID: 28 at io_uring/io_uring.c:734 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
CPU: 0 PID: 28 Comm: kworker/u4:1 Not tainted 6.2.0-rc3-syzkaller-16369-g358a161a6a9e #0
Workqueue: events_unbound io_ring_exit_work
Call trace:
 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
 io_req_cqe_overflow+0x5c/0x70 io_uring/io_uring.c:773
 io_fill_cqe_req io_uring/io_uring.h:168 [inline]
 io_do_iopoll+0x474/0x62c io_uring/rw.c:1065
 io_iopoll_try_reap_events+0x6c/0x108 io_uring/io_uring.c:1513
 io_uring_try_cancel_requests+0x13c/0x258 io_uring/io_uring.c:3056
 io_ring_exit_work+0xec/0x390 io_uring/io_uring.c:2869
 process_one_work+0x2d8/0x504 kernel/workqueue.c:2289
 worker_thread+0x340/0x610 kernel/workqueue.c:2436
 kthread+0x12c/0x158 kernel/kthread.c:376
 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:863

There is no real problem for normal IOPOLL as flush is also called with
uring_lock taken, but it's getting more complicated for IOPOLL|SQPOLL,
for which __io_cqring_overflow_flush() happens from the CQ waiting path.

Reported-and-tested-by: syzbot+6805087452d72929404e@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org # 5.10+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-13 07:32:46 -07:00
Jens Axboe 6e5aedb932 io_uring/poll: attempt request issue after racy poll wakeup
If we have multiple requests waiting on the same target poll waitqueue,
then it's quite possible to get a request triggered and get disappointed
in not being able to make any progress with it. If we race in doing so,
we'll potentially leave the poll request on the internal tables, but
removed from the waitqueue. That means that any subsequent trigger of
the poll waitqueue will not kick that request into action, causing an
application to potentially wait for completion of a request that will
never happen.

Fix this by adding a new poll return state, IOU_POLL_REISSUE. Rather
than have complicated logic for how to re-arm a given type of request,
just punt it for a reissue.

While in there, move the 'ret' variable to the only section where it
gets used. This avoids confusion the scope of it.

Cc: stable@vger.kernel.org
Fixes: eb0089d629 ("io_uring: single shot poll removal optimisation")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-12 10:35:51 -07:00
Jens Axboe ea97cbebaf io_uring/fdinfo: include locked hash table in fdinfo output
A previous commit split the hash table for polled requests into two
parts, but didn't get the fdinfo output updated. This means that it's
less useful for debugging, as we may think a given request is not pending
poll.

Fix this up by dumping the locked hash table contents too.

Fixes: 9ca9fb24d5 ("io_uring: mutex locked poll hashing")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-10 10:24:52 -07:00
Jens Axboe febb985c06 io_uring/poll: add hash if ready poll request can't complete inline
If we don't, then we may lose access to it completely, leading to a
request leak. This will eventually stall the ring exit process as
well.

Cc: stable@vger.kernel.org
Fixes: 49f1c68e04 ("io_uring: optimise submission side poll_refs")
Reported-and-tested-by: syzbot+6c95df01470a47fc3af4@syzkaller.appspotmail.com
Link: https://lore.kernel.org/io-uring/0000000000009f829805f1ce87b2@google.com/
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-09 15:46:57 -07:00
Jens Axboe 1e23db450c io_uring: use iter_ubuf for single range imports
This is more efficient than iter_iov.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
[merge to 6.2, minor fixes]
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-01-08 20:59:17 -07:00
Jens Axboe 4b61152e10 io_uring: switch network send/recv to ITER_UBUF
This is more efficient than iter_iov.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
[merged to 6.2]
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-01-08 20:59:17 -07:00
Jens Axboe e6db6f9398 io_uring/io-wq: only free worker if it was allocated for creation
We have two types of task_work based creation, one is using an existing
worker to setup a new one (eg when going to sleep and we have no free
workers), and the other is allocating a new worker. Only the latter
should be freed when we cancel task_work creation for a new worker.

Fixes: af82425c6a ("io_uring/io-wq: free worker if task_work creation is canceled")
Reported-by: syzbot+d56ec896af3637bdb7e4@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-08 10:39:17 -07:00
Pavel Begunkov 12521a5d5c io_uring: fix CQ waiting timeout handling
Jiffy to ktime CQ waiting conversion broke how we treat timeouts, in
particular we rearm it anew every time we get into
io_cqring_wait_schedule() without adjusting the timeout. Waiting for 2
CQEs and getting a task_work in the middle may double the timeout value,
or even worse in some cases task may wait indefinitely.

Cc: stable@vger.kernel.org
Fixes: 228339662b ("io_uring: don't convert to jiffies for waiting on timeouts")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f7bffddd71b08f28a877d44d37ac953ddb01590d.1672915663.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-05 08:04:47 -07:00
Pavel Begunkov f26cc95935 io_uring: lockdep annotate CQ locking
Locking around CQE posting is complex and depends on options the ring is
created with, add more thorough lockdep annotations checking all
invariants.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/aa3770b4eacae3915d782cc2ab2f395a99b4b232.1672795976.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-03 19:05:41 -07:00
Pavel Begunkov 9ffa13ff78 io_uring: pin context while queueing deferred tw
Unlike normal tw, nothing prevents deferred tw to be executed right
after an tw item added to ->work_llist in io_req_local_work_add(). For
instance, the waiting task may get waken up by CQ posting or a normal
tw. Thus we need to pin the ring for the rest of io_req_local_work_add()

Cc: stable@vger.kernel.org
Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1a79362b9c10b8523ef70b061d96523650a23344.1672795998.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-03 19:03:28 -07:00
Jens Axboe af82425c6a io_uring/io-wq: free worker if task_work creation is canceled
If we cancel the task_work, the worker will never come into existance.
As this is the last reference to it, ensure that we get it freed
appropriately.

Cc: stable@vger.kernel.org
Reported-by: 진호 <wnwlsgh98@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-02 16:49:46 -07:00
Jens Axboe 343190841a io_uring: check for valid register opcode earlier
We only check the register opcode value inside the restricted ring
section, move it into the main io_uring_register() function instead
and check it up front.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-23 06:40:32 -07:00
Jens Axboe 23fffb2f09 io_uring/cancel: re-grab ctx mutex after finishing wait
If we have a signal pending during cancelations, it'll cause the
task_work run to return an error. Since we didn't run task_work, the
current task is left in TASK_INTERRUPTIBLE state when we need to
re-grab the ctx mutex, and the kernel will rightfully complain about
that.

Move the lock grabbing for the error cases outside the loop to avoid
that issue.

Reported-by: syzbot+7df055631cd1be4586fd@syzkaller.appspotmail.com
Link: https://lore.kernel.org/io-uring/0000000000003a14a905f05050b0@google.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-21 13:31:40 -07:00
Jens Axboe 52ea806ad9 io_uring: finish waiting before flushing overflow entries
If we have overflow entries being generated after we've done the
initial flush in io_cqring_wait(), then we could be flushing them in the
main wait loop as well. If that's done after having added ourselves
to the cq_wait waitqueue, then the task state can be != TASK_RUNNING
when we enter the overflow flush.

Check for the need to overflow flush, and finish our wait cycle first
if we have to do so.

Reported-and-tested-by: syzbot+cf6ea1d6bb30a4ce10b2@syzkaller.appspotmail.com
Link: https://lore.kernel.org/io-uring/000000000000cb143a05f04eee15@google.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-21 08:43:53 -07:00
Pavel Begunkov 6c3e8955d4 io_uring/net: fix cleanup after recycle
Don't access io_async_msghdr io_netmsg_recycle(), it may be reallocated.

Cc: stable@vger.kernel.org
Fixes: 9bb66906f2 ("io_uring: support multishot in recvmsg")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9e326f4ad4046ddadf15bf34bf3fa58c6372f6b5.1671461985.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-19 08:28:28 -07:00
Jens Axboe 990a4de57e io_uring/net: ensure compat import handlers clear free_iov
If we're not allocating the vectors because the count is below
UIO_FASTIOV, we still do need to properly clear ->free_iov to prevent
an erronous free of on-stack data.

Reported-by: Jiri Slaby <jirislaby@gmail.com>
Fixes: 4c17a496a7 ("io_uring/net: fix cleanup double free free_iov init")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-19 07:35:16 -07:00
Jens Axboe 35d90f95cf io_uring: include task_work run after scheduling in wait for events
It's quite possible that we got woken up because task_work was queued,
and we need to process this task_work to generate the events waited for.
If we return to the wait loop without running task_work, we'll end up
adding the task to the waitqueue again, only to call
io_cqring_wait_schedule() again which will run the task_work. This is
less efficient than it could be, as it requires adding to the cq_wait
queue again. It also triggers the wakeup path for completions as
cq_wait is now non-empty with the task itself, and it'll require another
lock grab and deletion to remove ourselves from the waitqueue.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-17 20:35:54 -07:00
Jens Axboe 6434ec0186 io_uring: don't use TIF_NOTIFY_SIGNAL to test for availability of task_work
Use task_work_pending() as a better test for whether we have task_work
or not, TIF_NOTIFY_SIGNAL is only valid if the any of the task_work
items had been queued with TWA_SIGNAL as the notification mechanism.
Hence task_work_pending() is a more reliable check.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-17 13:40:17 -07:00
Dylan Yudaken 44a84da452 io_uring: use call_rcu_hurry if signaling an eventfd
io_uring uses call_rcu in the case it needs to signal an eventfd as a
result of an eventfd signal, since recursing eventfd signals are not
allowed. This should be calling the new call_rcu_hurry API to not delay
the signal.

Signed-off-by: Dylan Yudaken <dylany@meta.com>

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20221215184138.795576-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-15 11:59:29 -07:00
Pavel Begunkov a8cf95f936 io_uring: fix overflow handling regression
Because the single task locking series got reordered ahead of the
timeout and completion lock changes, two hunks inadvertently ended up
using __io_fill_cqe_req() rather than io_fill_cqe_req(). This meant
that we dropped overflow handling in those two spots. Reinstate the
correct CQE filling helper.

Fixes: f66f73421f ("io_uring: skip spinlocking for ->task_complete")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-15 08:20:10 -07:00
Pavel Begunkov e5f30f6fb2 io_uring: ease timeout flush locking requirements
We don't need completion_lock for timeout flushing, don't take it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1e3dc657975ac445b80e7bdc40050db783a5935a.1670002973.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14 08:53:35 -07:00
Pavel Begunkov 6971253f07 io_uring: revise completion_lock locking
io_kill_timeouts() doesn't post any events but queues everything to
task_work. Locking there is needed for protecting linked requests
traversing, we should grab completion_lock directly instead of using
io_cq_[un]lock helpers. Same goes for __io_req_find_next_prep().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/88e75d481a65dc295cb59722bb1cf76402d1c06b.1670002973.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14 08:53:04 -07:00
Pavel Begunkov ea011ee102 io_uring: protect cq_timeouts with timeout_lock
Read cq_timeouts in io_flush_timeouts() only after taking the
timeout_lock, as it's protected by it. There are many places where we
also grab ->completion_lock, but for instance io_timeout_fn() doesn't
and still modifies cq_timeouts.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9c79544dd6cf5c4018cb1bab99cf481a93ea46ef.1670002973.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14 08:51:28 -07:00
Linus Torvalds ce8a79d560 for-6.2/block-2022-12-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOScsgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpi5ID/9pLXFYOq1+uDjU0KO/MdjMjK8Ukr34lCnk
 WkajRLheE8JBKOFDE54XJk56sQSZHX9bTWqziar0h1fioh7FlQR/tVvzsERCm2M9
 2y9THJNJygC68wgybStyiKlshFjl7TD7Kv5N9Y3xP3mkQygT+D6o8fXZk5xQbYyH
 YdFSoq4rJVHxRL03yzQiReGGIYdOUEQQh8l1FiLwLlKa3lXAey1KuxWIzksVN0KK
 aZB4QhiBpOiPgDHUVisq2XtyQjpZ2byoCImPzgrcqk9Jo4esvm/e6esrg4xlsvII
 LKFFkTmbVqjUZtFjqakFHmfuzVor4nU5f+xb90ZHExuuODYckkxWp5rWhf9QwqqI
 0ik6WYgI1/5vnHnX8f2DYzOFQf9qa/rLgg0CshyUODlD6RfHa9vntqYvlIFkmOBd
 Q7KblIoK8YTzUS1M+v7X8JQ7gDR2KwygH37Da2KJS+vgvfIb8kJGr1ZORuhJuJJ7
 Bl69gaNkHTHrqufp7UI64YXfueeuNu2J9z3zwzGoxeaFaofF/phDn0/2gCQE1fQI
 XBhsMw+ETqI6B2SPHMnzYDu2DM1S8ZTOYQlaD4G3uqgWnAM1tG707395uAy5yu4n
 D5azU1fVG4UocoNIyPujpaoSRs2zWZycEFEeUQkhyDDww/j4hlHi6H33eOnk0zsr
 wxzFGfvHfw==
 =k/vv
 -----END PGP SIGNATURE-----

Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Christoph:
      - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan
        Joshi)
      - Refactor PCIe probing and reset (Christoph Hellwig)
      - Various fabrics authentication fixes and improvements (Sagi
        Grimberg)
      - Avoid fallback to sequential scan due to transient issues (Uday
        Shankar)
      - Implement support for the DEAC bit in Write Zeroes (Christoph
        Hellwig)
      - Allow overriding the IEEE OUI and firmware revision in configfs
        for nvmet (Aleksandr Miloserdov)
      - Force reconnect when number of queue changes in nvmet (Daniel
        Wagner)
      - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi
        Grimberg, Christoph Hellwig, Christophe JAILLET)
      - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni)
      - Use the common tagset helpers in nvme-pci driver (Christoph
        Hellwig)
      - Cleanup the nvme-pci removal path (Christoph Hellwig)
      - Use kstrtobool() instead of strtobool (Christophe JAILLET)
      - Allow unprivileged passthrough of Identify Controller (Joel
        Granados)
      - Support io stats on the mpath device (Sagi Grimberg)
      - Minor nvmet cleanup (Sagi Grimberg)

 - MD pull requests via Song:
      - Code cleanups (Christoph)
      - Various fixes

 - Floppy pull request from Denis:
      - Fix a memory leak in the init error path (Yuan)

 - Series fixing some batch wakeup issues with sbitmap (Gabriel)

 - Removal of the pktcdvd driver that was deprecated more than 5 years
   ago, and subsequent removal of the devnode callback in struct
   block_device_operations as no users are now left (Greg)

 - Fix for partition read on an exclusively opened bdev (Jan)

 - Series of elevator API cleanups (Jinlong, Christoph)

 - Series of fixes and cleanups for blk-iocost (Kemeng)

 - Series of fixes and cleanups for blk-throttle (Kemeng)

 - Series adding concurrent support for sync queues in BFQ (Yu)

 - Series bringing drbd a bit closer to the out-of-tree maintained
   version (Christian, Joel, Lars, Philipp)

 - Misc drbd fixes (Wang)

 - blk-wbt fixes and tweaks for enable/disable (Yu)

 - Fixes for mq-deadline for zoned devices (Damien)

 - Add support for read-only and offline zones for null_blk
   (Shin'ichiro)

 - Series fixing the delayed holder tracking, as used by DM (Yu,
   Christoph)

 - Series enabling bio alloc caching for IRQ based IO (Pavel)

 - Series enabling userspace peer-to-peer DMA (Logan)

 - BFQ waker fixes (Khazhismel)

 - Series fixing elevator refcount issues (Christoph, Jinlong)

 - Series cleaning up references around queue destruction (Christoph)

 - Series doing quiesce by tagset, enabling cleanups in drivers
   (Christoph, Chao)

 - Series untangling the queue kobject and queue references (Christoph)

 - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye,
   Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph)

* tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits)
  blktrace: Fix output non-blktrace event when blk_classic option enabled
  block: sed-opal: Don't include <linux/kernel.h>
  sed-opal: allow using IOC_OPAL_SAVE for locking too
  blk-cgroup: Fix typo in comment
  block: remove bio_set_op_attrs
  nvmet: don't open-code NVME_NS_ATTR_RO enumeration
  nvme-pci: use the tagset alloc/free helpers
  nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set
  nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers
  nvme: consolidate setting the tagset flags
  nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set
  block: bio_copy_data_iter
  nvme-pci: split out a nvme_pci_ctrl_is_dead helper
  nvme-pci: return early on ctrl state mismatch in nvme_reset_work
  nvme-pci: rename nvme_disable_io_queues
  nvme-pci: cleanup nvme_suspend_queue
  nvme-pci: remove nvme_pci_disable
  nvme-pci: remove nvme_disable_admin_queue
  nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl
  nvme: use nvme_wait_ready in nvme_shutdown_ctrl
  ...
2022-12-13 10:43:59 -08:00
Linus Torvalds 96f7e448b9 for-6.2/io_uring-next-2022-12-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOSZJAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpp+OD/9M04tGsVFCdqtKty5lBlDs03OnA03f404c
 Ottp2pop2JrBx7+ycS/dl78MQjmghh8Ectel8kOiswDeRQc98TtzWY31DF1d56yS
 aGYCq2Ww2gx5ziYmJgiFU7RRLTFlpfa/vZUBMK4HW4MYm2ihxtfNc72Oa8H9KGDJ
 /RYk5+PSCau+UFwyWu91rORVNRXjLr1mFmgzRTmFhL2unYYuOO83mK4GpK2f8rHx
 qpT7Wn9IS9xiTpr8rHqs8y6rxV6+Tnv/HqR8kKoviHvQU/u6fzvKSNEu1NvK/Znm
 V0z8cI4JJZUelDyqJe5ITq8cIS59amzILEIneclYQd20NkqcFYPlS56K6qR9qL7J
 6eNHvgH7iKvnk9JlR2soKojC6KWEPtVni2BjPEXgXHrfUWdINMKrT6MwTNOjztWj
 h+EaqLBGQb5m/nRCCMeE9kfUK23Rg6Ev+H+aas0SgqD5Isg/hVG+aMtjWLmWquCU
 pKg2UqxqsR9ymKj92KJSoN7F8Z2U0JsoHBKzAammlnmfhxl294RjGqBMJjjI5eS9
 Zu+fTte4EuFY5s/TvE5FCBmQ0Oeg+ud2f+GKXDzF25equtct8QCfHIbguM6Yr3X2
 3ANYGtLwP7Cj96U1Y++RWgTpBTTKwGkWyzEyS9SN/+MRXhIjeP8JZeGdXBDWaXLC
 Vp7j39Avgg==
 =MzhS
 -----END PGP SIGNATURE-----

Merge tag 'for-6.2/io_uring-next-2022-12-08' of git://git.kernel.dk/linux

Pull io_uring updates part two from Jens Axboe:

 - Misc fixes (me, Lin)

 - Series from Pavel extending the single task exclusive ring mode,
   yielding nice improvements for the common case of having a single
   ring per thread (Pavel)

 - Cleanup for MSG_RING, removing our IOPOLL hack (Pavel)

 - Further poll cleanups and fixes (Pavel)

 - Misc cleanups and fixes (Pavel)

* tag 'for-6.2/io_uring-next-2022-12-08' of git://git.kernel.dk/linux: (22 commits)
  io_uring/msg_ring: flag target ring as having task_work, if needed
  io_uring: skip spinlocking for ->task_complete
  io_uring: do msg_ring in target task via tw
  io_uring: extract a io_msg_install_complete helper
  io_uring: get rid of double locking
  io_uring: never run tw and fallback in parallel
  io_uring: use tw for putting rsrc
  io_uring: force multishot CQEs into task context
  io_uring: complete all requests in task context
  io_uring: don't check overflow flush failures
  io_uring: skip overflow CQE posting for dying ring
  io_uring: improve io_double_lock_ctx fail handling
  io_uring: dont remove file from msg_ring reqs
  io_uring: reshuffle issue_flags
  io_uring: don't reinstall quiesce node for each tw
  io_uring: improve rsrc quiesce refs checks
  io_uring: don't raw spin unlock to match cq_lock
  io_uring: combine poll tw handlers
  io_uring: improve poll warning handling
  io_uring: remove ctx variable in io_poll_check_events
  ...
2022-12-13 10:40:31 -08:00
Linus Torvalds 54e60e505d for-6.2/io_uring-2022-12-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOSYp4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgWNEADTcHejQ5y8Ctc1BEsCiEnCbR5DeIwESBKN
 SfK7uK5N5GjKZgKeMtkdxNeckH2wSZ79VFpz2/b4fszM11H+P81r/PP+OwQojcra
 1YyPbp2jXlQZGEYijiXYDohnKr8NnJYj/nuwm4T73OPOD48ekbrY36/t3Hd75jKk
 M5L/HOpbKhA+fypPHYlm3XlwEM9/4wupsDeiabTuAeFpLh66V/h85ZLY91c3Bf7j
 yllzT2CCGQsh+fnqdovpsqUxM4Sh6sHfa2QhwkfJFfU9OLwDz0TLlcvSVwWWICD1
 DGeDE/MPBKZ4Z5zp8t94vTnZDAiExwkZbmCOnOkdYoPdMDqCDhp8a9WywV3IiJJi
 +hRN6Au6BWw8Rj3b3pbobfdXFJAvoHqXHM1SDAsWmEIMpuMhNalIpYqtna9JwRDe
 KV2ERMpGOUpmvQiRaJa5Wtz6hlFetmuav6Ij9DWb5f8+NXikiFAXk4g6TtD1Z7Cb
 15klwM8qW50zX7YUlzBLcjNdj7+HuIswLx0obZ3Uv1ogDwMQKXEvPaAHUwVN02q6
 cWP0P2Bc0EKvdwTSNpgNflhNsiV2DFJZpZfxkdPauN4t2EkJ38iEpHewffy0ecM7
 uyaXGQVQW7WFDuGno4cGbThQIG95MqkRPBhEyB4cOmcyS/aZ92ZFtV1iI8dVDA+v
 uuEIMc3OCA==
 =EgDc
 -----END PGP SIGNATURE-----

Merge tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Always ensure proper ordering in case of CQ ring overflow, which then
   means we can remove some work-arounds for that (Dylan)

 - Support completion batching for multishot, greatly increasing the
   efficiency for those (Dylan)

 - Flag epoll/eventfd wakeups done from io_uring, so that we can easily
   tell if we're recursing into io_uring again.

   Previously, this would have resulted in repeated multishot
   notifications if we had a dependency there. That could happen if an
   eventfd was registered as the ring eventfd, and we multishot polled
   for events on it. Or if an io_uring fd was added to epoll, and
   io_uring had a multishot request for the epoll fd.

   Test cases here:
	https://git.kernel.dk/cgit/liburing/commit/?id=919755a7d0096fda08fb6d65ac54ad8d0fe027cd

   Previously these got terminated when the CQ ring eventually
   overflowed, now it's handled gracefully (me).

 - Tightening of the IOPOLL based completions (Pavel)

 - Optimizations of the networking zero-copy paths (Pavel)

 - Various tweaks and fixes (Dylan, Pavel)

* tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linux: (41 commits)
  io_uring: keep unlock_post inlined in hot path
  io_uring: don't use complete_post in kbuf
  io_uring: spelling fix
  io_uring: remove io_req_complete_post_tw
  io_uring: allow multishot polled reqs to defer completion
  io_uring: remove overflow param from io_post_aux_cqe
  io_uring: add lockdep assertion in io_fill_cqe_aux
  io_uring: make io_fill_cqe_aux static
  io_uring: add io_aux_cqe which allows deferred completion
  io_uring: allow defer completion for aux posted cqes
  io_uring: defer all io_req_complete_failed
  io_uring: always lock in io_apoll_task_func
  io_uring: remove iopoll spinlock
  io_uring: iopoll protect complete_post
  io_uring: inline __io_req_complete_put()
  io_uring: remove io_req_tw_post_queue
  io_uring: use io_req_task_complete() in timeout
  io_uring: hold locks for io_req_complete_failed
  io_uring: add completion locking for iopoll
  io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()
  ...
2022-12-13 10:33:08 -08:00
Linus Torvalds 9b93f5069f fs.idmapped.mnt_idmap.v6.2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5by6AAKCRCRxhvAZXjc
 onblAPsFzodV8/9UoCIkKxwn0aiclbiAITTWI9ZLulmKhm0I6wD/RUOLKjt12uZJ
 m81UTfkWHopWKtQ+X3saZEcyYTNLugE=
 =AtGb
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.mnt_idmap.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull idmapping updates from Christian Brauner:
 "Last cycle we've already made the interaction with idmapped mounts
  more robust and type safe by introducing the vfs{g,u}id_t type. This
  cycle we concluded the conversion and removed the legacy helpers.

  Currently we still pass around the plain namespace that was attached
  to a mount. This is in general pretty convenient but it makes it easy
  to conflate namespaces that are relevant on the filesystem - with
  namespaces that are relevent on the mount level. Especially for
  filesystem developers without detailed knowledge in this area this can
  be a potential source for bugs.

  Instead of passing the plain namespace we introduce a dedicated type
  struct mnt_idmap and replace the pointer with a pointer to a struct
  mnt_idmap. There are no semantic or size changes for the mount struct
  caused by this.

  We then start converting all places aware of idmapped mounts to rely
  on struct mnt_idmap. Once the conversion is done all helpers down to
  the really low-level make_vfs{g,u}id() and from_vfs{g,u}id() will take
  a struct mnt_idmap argument instead of two namespace arguments. This
  way it becomes impossible to conflate the two removing and thus
  eliminating the possibility of any bugs. Fwiw, I fixed some issues in
  that area a while ago in ntfs3 and ksmbd in the past. Afterwards only
  low-level code can ultimately use the associated namespace for any
  permission checks. Even most of the vfs can be completely obivious
  about this ultimately and filesystems will never interact with it in
  any form in the future.

  A struct mnt_idmap currently encompasses a simple refcount and pointer
  to the relevant namespace the mount is idmapped to. If a mount isn't
  idmapped then it will point to a static nop_mnt_idmap and if it
  doesn't that it is idmapped. As usual there are no allocations or
  anything happening for non-idmapped mounts. Everthing is carefully
  written to be a nop for non-idmapped mounts as has always been the
  case.

  If an idmapped mount is created a struct mnt_idmap is allocated and a
  reference taken on the relevant namespace. Each mount that gets
  idmapped or inherits the idmap simply bumps the reference count on
  struct mnt_idmap. Just a reminder that we only allow a mount to change
  it's idmapping a single time and only if it hasn't already been
  attached to the filesystems and has no active writers.

  The actual changes are fairly straightforward but this will have huge
  benefits for maintenance and security in the long run even if it
  causes some churn.

  Note that this also makes it possible to extend struct mount_idmap in
  the future. For example, it would be possible to place the namespace
  pointer in an anonymous union together with an idmapping struct. This
  would allow us to expose an api to userspace that would let it specify
  idmappings directly instead of having to go through the detour of
  setting up namespaces at all"

* tag 'fs.idmapped.mnt_idmap.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  acl: conver higher-level helpers to rely on mnt_idmap
  fs: introduce dedicated idmap type for mounts
2022-12-12 19:30:18 -08:00
Linus Torvalds 75f4d9af8b iov_iter work; most of that is about getting rid of
direction misannotations and (hopefully) preventing
 more of the same for the future.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY5ZzQAAKCRBZ7Krx/gZQ
 65RZAP4nTkvOn0NZLVFkuGOx8pgJelXAvrteyAuecVL8V6CR4AD40qCVY51PJp8N
 MzwiRTeqnGDxTTF7mgd//IB6hoatAA==
 =bcvF
 -----END PGP SIGNATURE-----

Merge tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull iov_iter updates from Al Viro:
 "iov_iter work; most of that is about getting rid of direction
  misannotations and (hopefully) preventing more of the same for the
  future"

* tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  use less confusing names for iov_iter direction initializers
  iov_iter: saner checks for attempt to copy to/from iterator
  [xen] fix "direction" argument of iov_iter_kvec()
  [vhost] fix 'direction' argument of iov_iter_{init,bvec}()
  [target] fix iov_iter_bvec() "direction" argument
  [s390] memcpy_real(): WRITE is "data source", not destination...
  [s390] zcore: WRITE is "data source", not destination...
  [infiniband] READ is "data destination", not source...
  [fsi] WRITE is "data source", not destination...
  [s390] copy_oldmem_kernel() - WRITE is "data source", not destination
  csum_and_copy_to_iter(): handle ITER_DISCARD
  get rid of unlikely() on page_copy_sane() calls
2022-12-12 18:29:54 -08:00
Jens Axboe 761c61c159 io_uring/msg_ring: flag target ring as having task_work, if needed
Before the recent change, we didn't even wake the targeted task when
posting the cqe remotely. Now we do wake it, but we do want to ensure
that the recipient knows there's potential work there that needs to
get processed to get the CQE posted.

OR in IORING_SQ_TASKRUN for that purpose.

Link: https://lore.kernel.org/io-uring/2843c6b4-ba9a-b67d-e0f4-957f42098489@kernel.dk/
Fixes: 6d043ee116 ("io_uring: do msg_ring in target task via tw")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-08 09:36:02 -07:00
Pavel Begunkov f66f73421f io_uring: skip spinlocking for ->task_complete
->task_complete was added to serialised CQE posting by doing it from
the task context only (or fallback wq when the task is dead), and now we
can use that to avoid taking ->completion_lock while filling CQ entries.
The patch skips spinlocking only in two spots,
__io_submit_flush_completions() and flushing in io_aux_cqe, it's safer
and covers all cases we care about. Extra care is taken to force taking
the lock while queueing overflow entries.

It fundamentally relies on SINGLE_ISSUER to have only one task posting
events. It also need to take into account overflowed CQEs, flushing of
which happens in the cq wait path, and so this implementation also needs
DEFER_TASKRUN to limit waiters. For the same reason we disable it for
SQPOLL, and for IOPOLL as it won't benefit from it in any case.
DEFER_TASKRUN, SQPOLL and IOPOLL requirement may be relaxed in the
future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2a8c91fd82cfcdcc1d2e5bac7051fe2c183bda73.1670384893.git.asml.silence@gmail.com
[axboe: modify to apply]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 08:51:08 -07:00
Pavel Begunkov 6d043ee116 io_uring: do msg_ring in target task via tw
While executing in a context of one io_uring instance msg_ring
manipulates another ring. We're trying to keep CQEs posting contained in
the context of the ring-owner task, use task_work to send the request to
the target ring's task when we're modifying its CQ or trying to install
a file. Note, we can't safely use io_uring task_work infra and have to
use task_work directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d76c7b28ed5d71b520de4482fbb7f660f21cd80.1670384893.git.asml.silence@gmail.com
[axboe: use TWA_SIGNAL_NO_IPI]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 08:50:57 -07:00
Pavel Begunkov 1721131016 io_uring: extract a io_msg_install_complete helper
Extract a helper called io_msg_install_complete() from io_msg_send_fd(),
will be used later.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1500ca1054cc4286a3ee1c60aacead57fcdfa02a.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Pavel Begunkov 11373026f2 io_uring: get rid of double locking
We don't need to take both uring_locks at once, msg_ring can be split in
two parts, first getting a file from the filetable of the first ring and
then installing it into the second one.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a80ecc2bc99c3b3f2cf20015d618b7c51419a797.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Pavel Begunkov 77e443ab29 io_uring: never run tw and fallback in parallel
Once we fallback a tw we want all requests to that task to be given to
the fallback wq so we dont run it in parallel with the last, i.e. post
PF_EXITING, tw run of the task.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/96f4987265c4312f376f206511c6af3e77aaf5ac.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Pavel Begunkov d34b1b0b67 io_uring: use tw for putting rsrc
Use task_work for completing rsrc removals, it'll be needed later for
spinlock optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cbba5d53a11ee6fc2194dacea262c1d733c8b529.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Pavel Begunkov 17add5cea2 io_uring: force multishot CQEs into task context
Multishot are posting CQEs outside of the normal request completion
path, which is usually done from within a task work handler. However, it
might be not the case when it's yet to be polled but has been punted to
io-wq. Make it abide ->task_complete and push it to the polling path
when executed by io-wq.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d7714aaff583096769a0f26e8e747759e556feb1.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Pavel Begunkov e6aeb2721d io_uring: complete all requests in task context
This patch adds ctx->task_complete flag. If set, we'll complete all
requests in the context of the original task. Note, this extends to
completion CQE posting only but not io_kiocb cleanup / free, e.g. io-wq
may free the requests in the free calllback. This flag will be used
later for optimisations purposes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/21ece72953f76bb2e77659a72a14326227ab6460.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Pavel Begunkov 1b346e4aa8 io_uring: don't check overflow flush failures
The only way to fail overflowed CQEs flush is for CQ to be fully packed.
There is one place checking for flush failures, i.e. io_cqring_wait(),
but we limit the number to be waited for by the CQ size, so getting a
failure automatically means that we're done with waiting.

Don't check for failures, rarely but they might spuriously fail CQ
waiting with -EBUSY.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6b720a45c03345655517f8202cbd0bece2848fb2.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Pavel Begunkov a85381d832 io_uring: skip overflow CQE posting for dying ring
After io_ring_ctx_wait_and_kill() is called there should be no users
poking into rings and so there is no need to post CQEs. So, instead of
trying to post overflowed CQEs into the CQ, drop them. Also, do it
in io_ring_exit_work() in a loop to reduce the number of contexts it
can be executed from and even when it struggles to quiesce the ring we
won't be leaving memory allocated for longer than needed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/26d13751155a735a3029e24f8d9ca992f810419d.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Pavel Begunkov 4c979eaefa io_uring: improve io_double_lock_ctx fail handling
msg_ring will fail the request if it can't lock rings, instead punt it
to io-wq as was originally intended.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4697f05afcc37df5c8f89e2fe6d9c7c19f0241f9.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Pavel Begunkov ef0ec1ad03 io_uring: dont remove file from msg_ring reqs
We should not be messing with req->file outside of core paths. Clearing
it makes msg_ring non reentrant, i.e. luckily io_msg_send_fd() fails the
request on failed io_double_lock_ctx() but clearly was originally
intended to do retries instead.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e5ac9edadb574fe33f6d727cb8f14ce68262a684.1670384893.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:47:13 -07:00
Harshit Mogalapalli 998b30c394 io_uring: Fix a null-ptr-deref in io_tctx_exit_cb()
Syzkaller reports a NULL deref bug as follows:

 BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3
 Read of size 4 at addr 0000000000000138 by task file1/1955

 CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0xcd/0x134
  ? io_tctx_exit_cb+0x53/0xd3
  kasan_report+0xbb/0x1f0
  ? io_tctx_exit_cb+0x53/0xd3
  kasan_check_range+0x140/0x190
  io_tctx_exit_cb+0x53/0xd3
  task_work_run+0x164/0x250
  ? task_work_cancel+0x30/0x30
  get_signal+0x1c3/0x2440
  ? lock_downgrade+0x6e0/0x6e0
  ? lock_downgrade+0x6e0/0x6e0
  ? exit_signals+0x8b0/0x8b0
  ? do_raw_read_unlock+0x3b/0x70
  ? do_raw_spin_unlock+0x50/0x230
  arch_do_signal_or_restart+0x82/0x2470
  ? kmem_cache_free+0x260/0x4b0
  ? putname+0xfe/0x140
  ? get_sigframe_size+0x10/0x10
  ? do_execveat_common.isra.0+0x226/0x710
  ? lockdep_hardirqs_on+0x79/0x100
  ? putname+0xfe/0x140
  ? do_execveat_common.isra.0+0x238/0x710
  exit_to_user_mode_prepare+0x15f/0x250
  syscall_exit_to_user_mode+0x19/0x50
  do_syscall_64+0x42/0xb0
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0023:0x0
 Code: Unable to access opcode bytes at 0xffffffffffffffd6.
 RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  </TASK>
 Kernel panic - not syncing: panic_on_warn set ...

This happens because the adding of task_work from io_ring_exit_work()
isn't synchronized with canceling all work items from eg exec. The
execution of the two are ordered in that they are both run by the task
itself, but if io_tctx_exit_cb() is queued while we're canceling all
work items off exec AND gets executed when the task exits to userspace
rather than in the main loop in io_uring_cancel_generic(), then we can
find current->io_uring == NULL and hit the above crash.

It's safe to add this NULL check here, because the execution of the two
paths are done by the task itself.

Cc: stable@vger.kernel.org
Fixes: d56d938b4b ("io_uring: do ctx initiated file note removal")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Link: https://lore.kernel.org/r/20221206093833.3812138-1-harshit.m.mogalapalli@oracle.com
[axboe: add code comment and also put an explanation in the commit msg]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-07 06:45:20 -07:00
Pavel Begunkov 77e3202a21 io_uring: don't reinstall quiesce node for each tw
There is no need to reinit data and install a new rsrc node every time
we get a task_work, it's detrimental, just execute it and conitnue
waiting.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3895d3344164cd9b3a0bbb24a6e357e20a13434b.1669821213.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30 10:29:07 -07:00
Pavel Begunkov 0ced756f64 io_uring: improve rsrc quiesce refs checks
Do a little bit of refactoring of io_rsrc_ref_quiesce(), flatten the
data refs checks and so get rid of a conditional weird unlock-else-break
construct.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d21283e9f88a77612c746ed526d86fe3bfb58a70.1669821213.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30 10:29:07 -07:00
Pavel Begunkov 618d653a34 io_uring: don't raw spin unlock to match cq_lock
There is one newly added place when we lock ring with io_cq_lock() but
unlocking is hand coded calling spin_unlock directly. It's ugly and
troublesome in the long run. Make it consistent with the other completion
locking.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4ca4f0564492b90214a190cd5b2a6c76522de138.1669821213.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30 10:28:49 -07:00
Pavel Begunkov 443e575506 io_uring: combine poll tw handlers
Merge apoll and regular poll tw handlers, it will help with inlining.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/482e59edb9fc81bd275fdbf486837330fb27120a.1669821213.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30 10:27:34 -07:00
Pavel Begunkov c3bfb57ea7 io_uring: improve poll warning handling
Don't try to complete requests if their refs are broken and we've got
a warning, it's much better to drop them and potentially leaking than
double freeing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/31edf9f96f05d03ab62c114508a231a2dce434cb.1669821213.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30 10:27:22 -07:00
Pavel Begunkov 047b6aef09 io_uring: remove ctx variable in io_poll_check_events
ctx is only used by io_poll_check_events() for multishot poll CQE
posting, don't save it on stack in advance.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/552c1771f8a0e7688afdb4f538ead245f53e80e7.1669821213.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30 10:26:57 -07:00
Pavel Begunkov 9805fa2d94 io_uring: carve io_poll_check_events fast path
The fast path in io_poll_check_events() is when we have only one
(i.e. master) reference. Move all verification, cancellations
checks, edge case handling and so on under a common if.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8c21c5d5e027e32dc553705e88796dec79ff6f93.1669821213.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30 10:26:57 -07:00
Pavel Begunkov f6f7f903e7 io_uring: kill io_poll_issue's PF_EXITING check
We don't need to worry about checking PF_EXITING in io_poll_issue().
task works using the function should take care of it and never try to
resubmit / retry if the task is dying.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2e9dc998dc07507c759a0c9cb5d2fbea0710d58c.1669821213.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-30 10:26:57 -07:00
Jens Axboe b2cf789f6c Merge branch 'for-6.2/io_uring' into for-6.2/io_uring-next
* for-6.2/io_uring: (41 commits)
  io_uring: keep unlock_post inlined in hot path
  io_uring: don't use complete_post in kbuf
  io_uring: spelling fix
  io_uring: remove io_req_complete_post_tw
  io_uring: allow multishot polled reqs to defer completion
  io_uring: remove overflow param from io_post_aux_cqe
  io_uring: add lockdep assertion in io_fill_cqe_aux
  io_uring: make io_fill_cqe_aux static
  io_uring: add io_aux_cqe which allows deferred completion
  io_uring: allow defer completion for aux posted cqes
  io_uring: defer all io_req_complete_failed
  io_uring: always lock in io_apoll_task_func
  io_uring: remove iopoll spinlock
  io_uring: iopoll protect complete_post
  io_uring: inline __io_req_complete_put()
  io_uring: remove io_req_tw_post_queue
  io_uring: use io_req_task_complete() in timeout
  io_uring: hold locks for io_req_complete_failed
  io_uring: add completion locking for iopoll
  io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()
  ...
2022-11-29 12:08:37 -07:00
Al Viro de4eda9de2 use less confusing names for iov_iter direction initializers
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.

Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-25 13:01:55 -05:00
Jens Axboe 7cfe7a0948 io_uring: clear TIF_NOTIFY_SIGNAL if set and task_work not available
With how task_work is added and signaled, we can have TIF_NOTIFY_SIGNAL
set and no task_work pending as it got run in a previous loop. Treat
TIF_NOTIFY_SIGNAL like get_signal(), always clear it if set regardless
of whether or not task_work is pending to run.

Cc: stable@vger.kernel.org
Fixes: 46a525e199 ("io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 10:55:08 -07:00
Lin Ma 12ad3d2d6c io_uring/poll: fix poll_refs race with cancelation
There is an interesting race condition of poll_refs which could result
in a NULL pointer dereference. The crash trace is like:

KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 PID: 30781 Comm: syz-executor.2 Not tainted 6.0.0-g493ffd6605b2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:io_poll_remove_entry io_uring/poll.c:154 [inline]
RIP: 0010:io_poll_remove_entries+0x171/0x5b4 io_uring/poll.c:190
Code: ...
RSP: 0018:ffff88810dfefba0 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000040000
RDX: ffffc900030c4000 RSI: 000000000003ffff RDI: 0000000000040000
RBP: 0000000000000008 R08: ffffffff9764d3dd R09: fffffbfff3836781
R10: fffffbfff3836781 R11: 0000000000000000 R12: 1ffff11003422d60
R13: ffff88801a116b04 R14: ffff88801a116ac0 R15: dffffc0000000000
FS:  00007f9c07497700(0000) GS:ffff88811a600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffb5c00ea98 CR3: 0000000105680005 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 io_apoll_task_func+0x3f/0xa0 io_uring/poll.c:299
 handle_tw_list io_uring/io_uring.c:1037 [inline]
 tctx_task_work+0x37e/0x4f0 io_uring/io_uring.c:1090
 task_work_run+0x13a/0x1b0 kernel/task_work.c:177
 get_signal+0x2402/0x25a0 kernel/signal.c:2635
 arch_do_signal_or_restart+0x3b/0x660 arch/x86/kernel/signal.c:869
 exit_to_user_mode_loop kernel/entry/common.c:166 [inline]
 exit_to_user_mode_prepare+0xc2/0x160 kernel/entry/common.c:201
 __syscall_exit_to_user_mode_work kernel/entry/common.c:283 [inline]
 syscall_exit_to_user_mode+0x58/0x160 kernel/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

The root cause for this is a tiny overlooking in
io_poll_check_events() when cocurrently run with poll cancel routine
io_poll_cancel_req().

The interleaving to trigger use-after-free:

CPU0                                       |  CPU1
                                           |
io_apoll_task_func()                       |  io_poll_cancel_req()
 io_poll_check_events()                    |
  // do while first loop                   |
  v = atomic_read(...)                     |
  // v = poll_refs = 1                     |
  ...                                      |  io_poll_mark_cancelled()
                                           |   atomic_or()
                                           |   // poll_refs =
IO_POLL_CANCEL_FLAG | 1
                                           |
  atomic_sub_return(...)                   |
  // poll_refs = IO_POLL_CANCEL_FLAG       |
  // loop continue                         |
                                           |
                                           |  io_poll_execute()
                                           |   io_poll_get_ownership()
                                           |   // poll_refs =
IO_POLL_CANCEL_FLAG | 1
                                           |   // gets the ownership
  v = atomic_read(...)                     |
  // poll_refs not change                  |
                                           |
  if (v & IO_POLL_CANCEL_FLAG)             |
   return -ECANCELED;                      |
  // io_poll_check_events return           |
  // will go into                          |
  // io_req_complete_failed() free req     |
                                           |
                                           |  io_apoll_task_func()
                                           |  // also go into
io_req_complete_failed()

And the interleaving to trigger the kernel WARNING:

CPU0                                       |  CPU1
                                           |
io_apoll_task_func()                       |  io_poll_cancel_req()
 io_poll_check_events()                    |
  // do while first loop                   |
  v = atomic_read(...)                     |
  // v = poll_refs = 1                     |
  ...                                      |  io_poll_mark_cancelled()
                                           |   atomic_or()
                                           |   // poll_refs =
IO_POLL_CANCEL_FLAG | 1
                                           |
  atomic_sub_return(...)                   |
  // poll_refs = IO_POLL_CANCEL_FLAG       |
  // loop continue                         |
                                           |
  v = atomic_read(...)                     |
  // v = IO_POLL_CANCEL_FLAG               |
                                           |  io_poll_execute()
                                           |   io_poll_get_ownership()
                                           |   // poll_refs =
IO_POLL_CANCEL_FLAG | 1
                                           |   // gets the ownership
                                           |
  WARN_ON_ONCE(!(v & IO_POLL_REF_MASK)))   |
  // v & IO_POLL_REF_MASK = 0 WARN         |
                                           |
                                           |  io_apoll_task_func()
                                           |  // also go into
io_req_complete_failed()

By looking up the source code and communicating with Pavel, the
implementation of this atomic poll refs should continue the loop of
io_poll_check_events() just to avoid somewhere else to grab the
ownership. Therefore, this patch simply adds another AND operation to
make sure the loop will stop if it finds the poll_refs is exactly equal
to IO_POLL_CANCEL_FLAG. Since io_poll_cancel_req() grabs ownership and
will finally make its way to io_req_complete_failed(), the req will
be reclaimed as expected.

Fixes: aa43477b04 ("io_uring: poll rework")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: tweak description and code style]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 07:17:33 -07:00
Lin Ma 9d94c04c0d io_uring/filetable: fix file reference underflow
There is an interesting reference bug when -ENOMEM occurs in calling of
io_install_fixed_file(). KASan report like below:

[   14.057131] ==================================================================
[   14.059161] BUG: KASAN: use-after-free in unix_get_socket+0x10/0x90
[   14.060975] Read of size 8 at addr ffff88800b09cf20 by task kworker/u8:2/45
[   14.062684]
[   14.062768] CPU: 2 PID: 45 Comm: kworker/u8:2 Not tainted 6.1.0-rc4 #1
[   14.063099] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   14.063666] Workqueue: events_unbound io_ring_exit_work
[   14.063936] Call Trace:
[   14.064065]  <TASK>
[   14.064175]  dump_stack_lvl+0x34/0x48
[   14.064360]  print_report+0x172/0x475
[   14.064547]  ? _raw_spin_lock_irq+0x83/0xe0
[   14.064758]  ? __virt_addr_valid+0xef/0x170
[   14.064975]  ? unix_get_socket+0x10/0x90
[   14.065167]  kasan_report+0xad/0x130
[   14.065353]  ? unix_get_socket+0x10/0x90
[   14.065553]  unix_get_socket+0x10/0x90
[   14.065744]  __io_sqe_files_unregister+0x87/0x1e0
[   14.065989]  ? io_rsrc_refs_drop+0x1c/0xd0
[   14.066199]  io_ring_exit_work+0x388/0x6a5
[   14.066410]  ? io_uring_try_cancel_requests+0x5bf/0x5bf
[   14.066674]  ? try_to_wake_up+0xdb/0x910
[   14.066873]  ? virt_to_head_page+0xbe/0xbe
[   14.067080]  ? __schedule+0x574/0xd20
[   14.067273]  ? read_word_at_a_time+0xe/0x20
[   14.067492]  ? strscpy+0xb5/0x190
[   14.067665]  process_one_work+0x423/0x710
[   14.067879]  worker_thread+0x2a2/0x6f0
[   14.068073]  ? process_one_work+0x710/0x710
[   14.068284]  kthread+0x163/0x1a0
[   14.068454]  ? kthread_complete_and_exit+0x20/0x20
[   14.068697]  ret_from_fork+0x22/0x30
[   14.068886]  </TASK>
[   14.069000]
[   14.069088] Allocated by task 289:
[   14.069269]  kasan_save_stack+0x1e/0x40
[   14.069463]  kasan_set_track+0x21/0x30
[   14.069652]  __kasan_slab_alloc+0x58/0x70
[   14.069899]  kmem_cache_alloc+0xc5/0x200
[   14.070100]  __alloc_file+0x20/0x160
[   14.070283]  alloc_empty_file+0x3b/0xc0
[   14.070479]  path_openat+0xc3/0x1770
[   14.070689]  do_filp_open+0x150/0x270
[   14.070888]  do_sys_openat2+0x113/0x270
[   14.071081]  __x64_sys_openat+0xc8/0x140
[   14.071283]  do_syscall_64+0x3b/0x90
[   14.071466]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   14.071791]
[   14.071874] Freed by task 0:
[   14.072027]  kasan_save_stack+0x1e/0x40
[   14.072224]  kasan_set_track+0x21/0x30
[   14.072415]  kasan_save_free_info+0x2a/0x50
[   14.072627]  __kasan_slab_free+0x106/0x190
[   14.072858]  kmem_cache_free+0x98/0x340
[   14.073075]  rcu_core+0x427/0xe50
[   14.073249]  __do_softirq+0x110/0x3cd
[   14.073440]
[   14.073523] Last potentially related work creation:
[   14.073801]  kasan_save_stack+0x1e/0x40
[   14.074017]  __kasan_record_aux_stack+0x97/0xb0
[   14.074264]  call_rcu+0x41/0x550
[   14.074436]  task_work_run+0xf4/0x170
[   14.074619]  exit_to_user_mode_prepare+0x113/0x120
[   14.074858]  syscall_exit_to_user_mode+0x1d/0x40
[   14.075092]  do_syscall_64+0x48/0x90
[   14.075272]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   14.075529]
[   14.075612] Second to last potentially related work creation:
[   14.075900]  kasan_save_stack+0x1e/0x40
[   14.076098]  __kasan_record_aux_stack+0x97/0xb0
[   14.076325]  task_work_add+0x72/0x1b0
[   14.076512]  fput+0x65/0xc0
[   14.076657]  filp_close+0x8e/0xa0
[   14.076825]  __x64_sys_close+0x15/0x50
[   14.077019]  do_syscall_64+0x3b/0x90
[   14.077199]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   14.077448]
[   14.077530] The buggy address belongs to the object at ffff88800b09cf00
[   14.077530]  which belongs to the cache filp of size 232
[   14.078105] The buggy address is located 32 bytes inside of
[   14.078105]  232-byte region [ffff88800b09cf00, ffff88800b09cfe8)
[   14.078685]
[   14.078771] The buggy address belongs to the physical page:
[   14.079046] page:000000001bd520e7 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88800b09de00 pfn:0xb09c
[   14.079575] head:000000001bd520e7 order:1 compound_mapcount:0 compound_pincount:0
[   14.079946] flags: 0x100000000010200(slab|head|node=0|zone=1)
[   14.080244] raw: 0100000000010200 0000000000000000 dead000000000001 ffff88800493cc80
[   14.080629] raw: ffff88800b09de00 0000000080190018 00000001ffffffff 0000000000000000
[   14.081016] page dumped because: kasan: bad access detected
[   14.081293]
[   14.081376] Memory state around the buggy address:
[   14.081618]  ffff88800b09ce00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   14.081974]  ffff88800b09ce80: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
[   14.082336] >ffff88800b09cf00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   14.082690]                                ^
[   14.082909]  ffff88800b09cf80: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
[   14.083266]  ffff88800b09d000: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
[   14.083622] ==================================================================

The actual tracing of this bug is shown below:

commit 8c71fe7502 ("io_uring: ensure fput() called correspondingly
when direct install fails") adds an additional fput() in
io_fixed_fd_install() when io_file_bitmap_get() returns error values. In
that case, the routine will never make it to io_install_fixed_file() due
to an early return.

static int io_fixed_fd_install(...)
{
  if (alloc_slot) {
    ...
    ret = io_file_bitmap_get(ctx);
    if (unlikely(ret < 0)) {
      io_ring_submit_unlock(ctx, issue_flags);
      fput(file);
      return ret;
    }
    ...
  }
  ...
  ret = io_install_fixed_file(req, file, issue_flags, file_slot);
  ...
}

In the above scenario, the reference is okay as io_fixed_fd_install()
ensures the fput() is called when something bad happens, either via
bitmap or via inner io_install_fixed_file().

However, the commit 61c1b44a21 ("io_uring: fix deadlock on iowq file
slot alloc") breaks the balance because it places fput() into the common
path for both io_file_bitmap_get() and io_install_fixed_file(). Since
io_install_fixed_file() handles the fput() itself, the reference
underflow come across then.

There are some extra commits make the current code into
io_fixed_fd_install() -> __io_fixed_fd_install() ->
io_install_fixed_file()

However, the fact that there is an extra fput() is called if
io_install_fixed_file() calls fput(). Traversing through the code, I
find that the existing two callers to __io_fixed_fd_install():
io_fixed_fd_install() and io_msg_send_fd() have fput() when handling
error return, this patch simply removes the fput() in
io_install_fixed_file() to fix the bug.

Fixes: 61c1b44a21 ("io_uring: fix deadlock on iowq file slot alloc")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Link: https://lore.kernel.org/r/be4ba4b.5d44.184a0a406a4.Coremail.linma@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:54:46 -07:00
Pavel Begunkov a26a35e901 io_uring: make poll refs more robust
poll_refs carry two functions, the first is ownership over the request.
The second is notifying the io_poll_check_events() that there was an
event but wake up couldn't grab the ownership, so io_poll_check_events()
should retry.

We want to make poll_refs more robust against overflows. Instead of
always incrementing it, which covers two purposes with one atomic, check
if poll_refs is elevated enough and if so set a retry flag without
attempts to grab ownership. The gap between the bias check and following
atomics may seem racy, but we don't need it to be strict. Moreover there
might only be maximum 4 parallel updates: by the first and the second
poll entries, __io_arm_poll_handler() and cancellation. From those four,
only poll wake ups may be executed multiple times, but they're protected
by a spin.

Cc: stable@vger.kernel.org
Reported-by: Lin Ma <linma@zju.edu.cn>
Fixes: aa43477b04 ("io_uring: poll rework")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c762bc31f8683b3270f3587691348a7119ef9c9d.1668963050.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:54:46 -07:00
Pavel Begunkov 2f3893437a io_uring: cmpxchg for poll arm refs release
Replace atomically substracting the ownership reference at the end of
arming a poll with a cmpxchg. We try to release ownership by setting 0
assuming that poll_refs didn't change while we were arming. If it did
change, we keep the ownership and use it to queue a tw, which is fully
capable to process all events and (even tolerates spurious wake ups).

It's a bit more elegant as we reduce races b/w setting the cancellation
flag and getting refs with this release, and with that we don't have to
worry about any kinds of underflows. It's not the fastest path for
polling. The performance difference b/w cmpxchg and atomic dec is
usually negligible and it's not the fastest path.

Cc: stable@vger.kernel.org
Fixes: aa43477b04 ("io_uring: poll rework")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0c95251624397ea6def568ff040cad2d7926fd51.1668963050.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:54:16 -07:00
Pavel Begunkov 5d77291685 io_uring: keep unlock_post inlined in hot path
This partially reverts

6c16fe3c16 ("io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()")

The redundancy of __io_cq_unlock_post() was always to keep it inlined
into __io_submit_flush_completions(). Inline it back and rename with
hope of clarifying the intention behind it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/372a16c485fca44c069be2e92fc5e7332a1d7fd7.1669310258.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:11:15 -07:00
Pavel Begunkov c3b490930d io_uring: don't use complete_post in kbuf
Now we're handling IOPOLL completions more generically, get rid uses of
_post() and send requests through the normal path. It may have some
extra mertis performance wise, but we don't care much as there is a
better interface for selected buffers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4deded706587f55b006dc33adf0c13cfc3b2319f.1669310258.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:11:15 -07:00
Dylan Yudaken 10d8bc3541 io_uring: spelling fix
s/pushs/pushes/

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221125103412.1425305-3-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:46 -07:00
Dylan Yudaken 27f35fe909 io_uring: remove io_req_complete_post_tw
It's only used in one place. Inline it.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221125103412.1425305-2-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:46 -07:00
Dylan Yudaken 9a6924519e io_uring: allow multishot polled reqs to defer completion
Until now there was no reason for multishot polled requests to defer
completions as there was no functional difference. However now this will
actually defer the completions, for a performance win.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-10-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Dylan Yudaken b529c96a89 io_uring: remove overflow param from io_post_aux_cqe
The only call sites which would not allow overflow are also call sites
which would use the io_aux_cqe as they care about ordering.

So remove this parameter from io_post_aux_cqe.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-9-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Dylan Yudaken 2e2ef4a1da io_uring: add lockdep assertion in io_fill_cqe_aux
Add an assertion for the completion lock to io_fill_cqe_aux

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-8-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Dylan Yudaken a77ab745f2 io_uring: make io_fill_cqe_aux static
This is only used in io_uring.c

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-7-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Dylan Yudaken 9b8c54755a io_uring: add io_aux_cqe which allows deferred completion
Use the just introduced deferred post cqe completion state when possible
in io_aux_cqe. If not possible fallback to io_post_aux_cqe.

This introduces a complication because of allow_overflow. For deferred
completions we cannot know without locking the completion_lock if it will
overflow (and even if we locked it, another post could sneak in and cause
this cqe to be in overflow).
However since overflow protection is mostly a best effort defence in depth
to prevent infinite loops of CQEs for poll, just checking the overflow bit
is going to be good enough and will result in at most 16 (array size of
deferred cqes) overflows.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-6-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Dylan Yudaken 931147ddfa io_uring: allow defer completion for aux posted cqes
Multishot ops cannot use the compl_reqs list as the request must stay in
the poll list, but that means they need to run each completion without
benefiting from batching.

Here introduce batching infrastructure for only small (ie 16 byte)
CQEs. This restriction is ok because there are no use cases posting 32
byte CQEs.

In the ring keep a batch of up to 16 posted results, and flush in the same
way as compl_reqs.

16 was chosen through experimentation on a microbenchmark ([1]), as well
as trying not to increase the size of the ring too much. This increases
the size to 1472 bytes from 1216.

[1]: 9ac66b36bc
Run with $ make -j && ./benchmark/reg.b -s 1 -t 2000 -r 10
Gives results:
baseline	8309 k/s
8		18807 k/s
16		19338 k/s
32		20134 k/s

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-5-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Dylan Yudaken 973fc83f3a io_uring: defer all io_req_complete_failed
All failures happen under lock now, and can be deferred. To be consistent
when the failure has happened after some multishot cqe has been
deferred (and keep ordering), always defer failures.

To make this obvious at the caller (and to help prevent a future bug)
rename io_req_complete_failed to io_req_defer_failed.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-4-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Dylan Yudaken c06c6c5d27 io_uring: always lock in io_apoll_task_func
This is required for the failure case (io_req_complete_failed) and is
missing.

The alternative would be to only lock in the failure path, however all of
the non-error paths in io_poll_check_events that do not do not return
IOU_POLL_NO_ACTION end up locking anyway. The only extraneous lock would
be for the multishot poll overflowing the CQE ring, however multishot poll
would probably benefit from being locked as it will allow completions to
be batched.

So it seems reasonable to lock always.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-3-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-25 06:10:04 -07:00
Pavel Begunkov 2dac1a1592 io_uring: remove iopoll spinlock
This reverts commit 2ccc92f4ef

io_req_complete_post() should now behave well even in case of IOPOLL, we
can remove completion_lock locking.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7e171c8b530656b14a671c59100ca260e46e7f2a.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23 10:47:07 -07:00
Pavel Begunkov 1bec951c38 io_uring: iopoll protect complete_post
io_req_complete_post() may be used by iopoll enabled rings, grab locks
in this case. That requires to pass issue_flags to propagate the locking
state.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cc6d854065c57c838ca8e8806f707a226b70fd2d.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23 10:45:31 -07:00
Pavel Begunkov fa18fa2272 io_uring: inline __io_req_complete_put()
Inline __io_req_complete_put() into io_req_complete_post(), there are no
other users.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1923a4dfe80fa877f859a22ed3df2d5fc8ecf02b.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23 10:44:01 -07:00
Pavel Begunkov 833b5dfffc io_uring: remove io_req_tw_post_queue
Remove io_req_tw_post() and io_req_tw_post_queue(), we can use
io_req_task_complete() instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b9b73c08022c7f1457023ac841f35c0100e70345.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23 10:44:00 -07:00
Pavel Begunkov 624fd779fd io_uring: use io_req_task_complete() in timeout
Use a more generic io_req_task_complete() in timeout completion
task_work instead of io_req_complete_post().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bda1710b58c07bf06107421c2a65c529ea9cdcac.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23 10:44:00 -07:00
Pavel Begunkov e276ae344a io_uring: hold locks for io_req_complete_failed
A preparation patch, make sure we always hold uring_lock around
io_req_complete_failed(). The only place deviating from the rule
is io_cancel_defer_files(), queue a tw instead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/70760344eadaecf2939287084b9d4ba5c05a6984.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23 10:44:00 -07:00
Pavel Begunkov 2ccc92f4ef io_uring: add completion locking for iopoll
There are pieces of code that may allow iopoll to race filling cqes,
temporarily add spinlocking around posting events.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/84d86b5c117feda075471c5c9e65208e0dccf5d0.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-23 10:44:00 -07:00
Jens Axboe 6c16fe3c16 io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()
__io_cq_unlock_post() is identical to io_cq_unlock_post(), and
io_cqring_ev_posted() has a single caller so migth as well just inline
it there.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-22 06:09:30 -07:00
Jens Axboe 4061f0ef73 Revert "io_uring: disallow self-propelled ring polling"
This reverts commit 7fdbc5f014.

This patch dealt with a subset of the real problem, which is a potential
circular dependency on the wakup path for io_uring itself. Outside of
io_uring, eventfd can also trigger this (see details in 03e02acda8)
and so can epoll (see details in caf1aeaffc). Now that we have a
generic solution to this problem, get rid of the io_uring specific
work-around.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-22 06:08:31 -07:00
Jens Axboe 4464853277 io_uring: pass in EPOLL_URING_WAKE for eventfd signaling and wakeups
Pass in EPOLL_URING_WAKE when signaling eventfd or doing poll related
wakups, so that we can check for a circular event dependency between
eventfd and epoll. If this flag is set when our wakeup handlers are
called, then we know we have a dependency that needs to terminate
multishot requests.

eventfd and epoll are the only such possible dependencies.

Cc: stable@vger.kernel.org # 6.0
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-22 06:08:31 -07:00
Pavel Begunkov f9d567c75e io_uring: inline __io_req_complete_post()
There is only one user of __io_req_complete_post(), inline it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ef4c9059950a3da5cf68df00f977f1fd13bd9306.1668597569.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:45:19 -07:00
Pavel Begunkov d759360620 io_uring: split tw fallback into a function
When the target process is dying and so task_work_add() is not allowed
we push all task_work item to the fallback workqueue. Move the part
responsible for moving tw items out of __io_req_task_work_add() into
a separate function. Makes it a bit cleaner and gives the compiler a bit
of extra info.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e503dab9d7af95470ca6b214c6de17715ae4e748.1668162751.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:44:21 -07:00
Pavel Begunkov e52d2e583e io_uring: inline io_req_task_work_add()
__io_req_task_work_add() is huge but marked inline, that makes compilers
to generate lots of garbage. Inline the wrapper caller
io_req_task_work_add() instead.

before and after:
   text    data     bss     dec     hex filename
  47347   16248       8   63603    f873 io_uring/io_uring.o
   text    data     bss     dec     hex filename
  45303   16248       8   61559    f077 io_uring/io_uring.o

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/26dc8c28ca0160e3269ef3e55c5a8b917c4d4450.1668162751.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:44:18 -07:00
Lin Ma 23a6c9ac4d io_uring: update outdated comment of callbacks
Previous commit ebc11b6c6b ("io_uring: clean io-wq callbacks") rename
io_free_work() into io_wq_free_work() for consistency. This patch also
updates relevant comment to avoid misunderstanding.

Fixes: ebc11b6c6b ("io_uring: clean io-wq callbacks")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Link: https://lore.kernel.org/r/20221110122103.20120-1-linma@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:44:16 -07:00
Lin Ma cd42a53d25 io_uring/poll: remove outdated comments of caching
Previous commit 13a99017ff ("io_uring: remove events caching
atavisms") entirely removes the events caching optimization introduced
by commit 81459350d5 ("io_uring: cache req->apoll->events in
req->cflags"). Hence the related comment should also be removed to avoid
misunderstanding.

Fixes: 13a99017ff ("io_uring: remove events caching atavisms")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Link: https://lore.kernel.org/r/20221110060313.16303-1-linma@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:44:14 -07:00
Dylan Yudaken e2ad599d1e io_uring: allow multishot recv CQEs to overflow
With commit aa1df3a360 ("io_uring: fix CQE reordering"), there are
stronger guarantees for overflow ordering. Specifically ensuring that
userspace will not receive out of order receive CQEs. Therefore this is
not needed any more for recv/recvmsg.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221107125236.260132-4-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:44:09 -07:00
Dylan Yudaken 515e269612 io_uring: revert "io_uring fix multishot accept ordering"
This is no longer needed after commit aa1df3a360 ("io_uring: fix CQE
reordering"), since all reordering is now taken care of.

This reverts commit cbd2574854 ("io_uring: fix multishot accept
ordering").

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221107125236.260132-2-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:43:42 -07:00
Dylan Yudaken ef67fcb41d io_uring: do not always force run task_work in io_uring_register
Running task work when not needed can unnecessarily delay
operations. Specifically IORING_SETUP_DEFER_TASKRUN tries to avoid running
task work until the user requests it. Therefore do not run it in
io_uring_register any more.

The one catch is that io_rsrc_ref_quiesce expects it to have run in order
to process all outstanding references, and so reorder it's loop to do this.

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221107123349.4106213-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:38:54 -07:00
Xinghui Li df730ec21f io_uring: fix two assignments in if conditions
Fixes two errors:

"ERROR: do not use assignment in if condition
130: FILE: io_uring/net.c:130:
+       if (!(issue_flags & IO_URING_F_UNLOCKED) &&

ERROR: do not use assignment in if condition
599: FILE: io_uring/poll.c:599:
+       } else if (!(issue_flags & IO_URING_F_UNLOCKED) &&"
reported by checkpatch.pl in net.c and poll.c .

Signed-off-by: Xinghui Li <korantli@tencent.com>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/r/20221102082503.32236-1-korantwork@gmail.com
[axboe: style tweaks]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:38:31 -07:00
Pavel Begunkov 42385b02ba io_uring/net: move mm accounting to a slower path
We can also move mm accounting to the extended callbacks. It removes a
few cycles from the hot path including skipping one function call and
setting io_req_task_complete as a callback directly. For user backed I/O
it shouldn't make any difference taking into considering atomic mm
accounting and page pinning.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1062f270273ad11c1b7b45ec59a6a317533d5e64.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:38:31 -07:00
Pavel Begunkov 40725d1b96 io_uring: move zc reporting from the hot path
Add custom tw and notif callbacks on top of usual bits also handling zc
reporting. That moves it from the hot path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/40de4a6409042478e1f35adc4912e23226cb1b5c.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:38:31 -07:00
Pavel Begunkov bedd20bcf3 io_uring/net: inline io_notif_flush()
io_notif_flush() is pretty simple, we can inline it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/332359e7bd124138dfe51340bbec829c9b265c18.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:38:31 -07:00
Pavel Begunkov 7fa8e84192 io_uring/net: rename io_uring_tx_zerocopy_callback
Just a simple renaming patch, io_uring_tx_zerocopy_callback() is too
bulky and doesn't follow usual naming style.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/24d78325403ca6dcb1ec4bced1e33cacc9b832a5.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:38:31 -07:00
Pavel Begunkov fc1dd0d4fa io_uring/net: preset notif tw handler
We're going to have multiple notification tw functions. In preparation
for future changes default the tw callback in advance so later we can
replace it with other versions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7acdbea5e20eadd844513320cd454af14ba50f64.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:38:31 -07:00
Pavel Begunkov 5bc8e8884b io_uring/net: remove extra notif rsrc setup
io_send_zc_prep() sets up notification's rsrc_node when needed, don't
unconditionally install it on notif alloc.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dbe4875ac33e180b9799d8537a5e27935e82aac4.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:38:31 -07:00
Pavel Begunkov 3671163beb io_uring: move kbuf put out of generic tw complete
There are multiple users of io_req_task_complete() including zc
notifications, but only read requests use selected buffers. As we
already have an rw specific tw function, move io_put_kbuf() in there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/94374c7649aaefc3a17808dc4701f25ccd457e25.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:38:31 -07:00
Stefan Metzmacher e307e66981 io_uring/net: introduce IORING_SEND_ZC_REPORT_USAGE flag
It might be useful for applications to detect if a zero copy transfer with
SEND[MSG]_ZC was actually possible or not. The application can fallback to
plain SEND[MSG] in order to avoid the overhead of two cqes per request. Or
it can generate a log message that could indicate to an administrator that
no zero copy was possible and could explain degraded performance.

Cc: stable@vger.kernel.org # 6.1
Link: https://lore.kernel.org/io-uring/fb6a7599-8a9b-15e5-9b64-6cd9d01c6ff4@gmail.com/T/#m2b0d9df94ce43b0e69e6c089bdff0ce6babbdfaa
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8945b01756d902f5d5b0667f20b957ad3f742e5e.1666895626.git.metze@samba.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-21 07:38:31 -07:00
Pavel Begunkov 7fdbc5f014 io_uring: disallow self-propelled ring polling
When we post a CQE we wake all ring pollers as it normally should be.
However, if a CQE was generated by a multishot poll request targeting
its own ring, it'll wake that request up, which will make it to post
a new CQE, which will wake the request and so on until it exhausts all
CQ entries.

Don't allow multishot polling io_uring files but downgrade them to
oneshots, which was always stated as a correct behaviour that the
userspace should check for.

Cc: stable@vger.kernel.org
Fixes: aa43477b04 ("io_uring: poll rework")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3124038c0e7474d427538c2d915335ec28c92d21.1668785722.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-18 09:29:31 -07:00
Pavel Begunkov 100d6b17c0 io_uring: fix multishot recv request leaks
Having REQ_F_POLLED set doesn't guarantee that the request is
executed as a multishot from the polling path. Fortunately for us, if
the code thinks it's multishot issue when it's not, it can only ask to
skip completion so leaking the request. Use issue_flags to mark
multipoll issues.

Cc: stable@vger.kernel.org
Fixes: 1300ebb20286b ("io_uring: multishot recv")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/37762040ba9c52b81b92a2f5ebfd4ee484088951.1668710222.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-17 12:33:33 -07:00
Pavel Begunkov 9148286476 io_uring: fix multishot accept request leaks
Having REQ_F_POLLED set doesn't guarantee that the request is
executed as a multishot from the polling path. Fortunately for us, if
the code thinks it's multishot issue when it's not, it can only ask to
skip completion so leaking the request. Use issue_flags to mark
multipoll issues.

Cc: stable@vger.kernel.org
Fixes: 390ed29b5e ("io_uring: add IORING_ACCEPT_MULTISHOT for accept")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7700ac57653f2823e30b34dc74da68678c0c5f13.1668710222.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-17 12:33:33 -07:00
Pavel Begunkov 539bcb57da io_uring: fix tw losing poll events
We may never try to process a poll wake and its mask if there was
multiple wake ups racing for queueing up a tw. Force
io_poll_check_events() to update the mask by vfs_poll().

Cc: stable@vger.kernel.org
Fixes: aa43477b04 ("io_uring: poll rework")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/00344d60f8b18907171178d7cf598de71d127b0b.1668710222.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-17 12:33:33 -07:00
Pavel Begunkov b98186aee2 io_uring: update res mask in io_poll_check_events
When io_poll_check_events() collides with someone attempting to queue a
task work, it'll spin for one more time. However, it'll continue to use
the mask from the first iteration instead of updating it. For example,
if the first wake up was a EPOLLIN and the second EPOLLOUT, the
userspace will not get EPOLLOUT in time.

Clear the mask for all subsequent iterations to force vfs_poll().

Cc: stable@vger.kernel.org
Fixes: aa43477b04 ("io_uring: poll rework")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2dac97e8f691231049cb259c4ae57e79e40b537c.1668710222.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-17 12:33:33 -07:00
Pavel Begunkov 12e4e8c7ab io_uring/rw: enable bio caches for IRQ rw
Now we can use IOCB_ALLOC_CACHE not only for iopoll'ed reads/write but
also for normal IRQ driven I/O.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fb8bd092ed5a4a3b037e84e4777074d07aa5639a.1667384020.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-16 09:44:27 -07:00
Pavel Begunkov 5576035f15 io_uring/poll: lockdep annote io_poll_req_insert_locked
Add a lockdep annotation in io_poll_req_insert_locked().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8115d8e702733754d0aea119e9b5bb63d1eb8b24.1668184658.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-11 09:59:27 -07:00
Pavel Begunkov 30a33669fa io_uring/poll: fix double poll req->flags races
io_poll_double_prepare()            | io_poll_wake()
                                    | poll->head = NULL
smp_load(&poll->head); /* NULL */   |
flags = req->flags;                 |
                                    | req->flags &= ~SINGLE_POLL;
req->flags = flags | DOUBLE_POLL    |

The idea behind io_poll_double_prepare() is to serialise with the
first poll entry by taking the wq lock. However, it's not safe to assume
that io_poll_wake() is not running when we can't grab the lock and so we
may race modifying req->flags.

Skip double poll setup if that happens. It's ok because the first poll
entry will only be removed when it's definitely completing, e.g.
pollfree or oneshot with a valid mask.

Fixes: 49f1c68e04 ("io_uring: optimise submission side poll_refs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b7fab2d502f6121a7d7b199fe4d914a43ca9cdfd.1668184658.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-11 09:59:27 -07:00
Jens Axboe 3851d25c75 io_uring: check for rollover of buffer ID when providing buffers
We already check if the chosen starting offset for the buffer IDs fit
within an unsigned short, as 65535 is the maximum value for a provided
buffer. But if the caller asks to add N buffers at offset M, and M + N
would exceed the size of the unsigned short, we simply add buffers with
wrapping around the ID.

This is not necessarily a bug and could in fact be a valid use case, but
it seems confusing and inconsistent with the initial check for starting
offset. Let's check for wrap consistently, and error the addition if we
do need to wrap.

Reported-by: Olivier Langlois <olivier@trillion01.com>
Link: https://github.com/axboe/liburing/issues/726
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-10 11:07:41 -07:00
Dylan Yudaken 0fc8c2acbf io_uring: calculate CQEs from the user visible value
io_cqring_wait (and it's wake function io_has_work) used cached_cq_tail in
order to calculate the number of CQEs. cached_cq_tail is set strictly
before the user visible rings->cq.tail

However as far as userspace is concerned,  if io_uring_enter(2) is called
with a minimum number of events, they will verify by checking
rings->cq.tail.

It is therefore possible for io_uring_enter(2) to return early with fewer
events visible to the user.

Instead make the wait functions read from the user visible value, so there
will be no discrepency.

This is triggered eventually by the following reproducer:

struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
unsigned int cqe_ready;
struct io_uring ring;
int ret, i;

ret = io_uring_queue_init(N, &ring, 0);
assert(!ret);
while(true) {
	for (i = 0; i < N; i++) {
		sqe = io_uring_get_sqe(&ring);
		io_uring_prep_nop(sqe);
		sqe->flags |= IOSQE_ASYNC;
	}
	ret = io_uring_submit(&ring);
	assert(ret == N);

	do {
		ret = io_uring_wait_cqes(&ring, &cqe, N, NULL, NULL);
	} while(ret == -EINTR);
	cqe_ready = io_uring_cq_ready(&ring);
	assert(!ret);
	assert(cqe_ready == N);
	io_uring_cq_advance(&ring, N);
}

Fixes: ad3eb2c89f ("io_uring: split overflow state into SQ and CQ side")
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221108153016.1854297-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-08 10:36:15 -07:00
Christian Brauner 5a6f52d20c
acl: conver higher-level helpers to rely on mnt_idmap
Convert an initial portion to rely on struct mnt_idmap by converting the
high level xattr helpers.

Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-10-31 17:48:12 +01:00
Dylan Yudaken b3026767e1 io_uring: unlock if __io_run_local_work locked inside
It is possible for tw to lock the ring, and this was not propogated out to
io_run_local_work. This can cause an unlock to be missed.

Instead pass a pointer to locked into __io_run_local_work.

Fixes: 8ac5d85a89 ("io_uring: add local task_work run helper that is entered locked")
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221027144429.3971400-3-dylany@meta.com
[axboe: WARN_ON() -> WARN_ON_ONCE() and add a minor comment]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-27 09:52:12 -06:00
Dylan Yudaken 8de11cdc96 io_uring: use io_run_local_work_locked helper
prefer to use io_run_local_work_locked helper for consistency

Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221027144429.3971400-2-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-27 09:51:47 -06:00
Pavel Begunkov cc767e7c69 io_uring/net: fail zc sendmsg when unsupported by socket
The previous patch fails zerocopy send requests for protocols that don't
support it, do the same for zerocopy sendmsg.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0854e7bb4c3d810a48ec8b5853e2f61af36a0467.1666346426.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-22 08:43:03 -06:00
Pavel Begunkov edf8143879 io_uring/net: fail zc send when unsupported by socket
If a protocol doesn't support zerocopy it will silently fall back to
copying. This type of behaviour has always been a source of troubles
so it's better to fail such requests instead.

Cc: <stable@vger.kernel.org> # 6.0
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2db3c7f16bb6efab4b04569cd16e6242b40c5cb3.1666346426.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-22 08:43:03 -06:00
Rafael Mendonca 996d3efeb0 io-wq: Fix memory leak in worker creation
If the CPU mask allocation for a node fails, then the memory allocated for
the 'io_wqe' struct of the current node doesn't get freed on the error
handling path, since it has not yet been added to the 'wqes' array.

This was spotted when fuzzing v6.1-rc1 with Syzkaller:
BUG: memory leak
unreferenced object 0xffff8880093d5000 (size 1024):
  comm "syz-executor.2", pid 7701, jiffies 4295048595 (age 13.900s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000cb463369>] __kmem_cache_alloc_node+0x18e/0x720
    [<00000000147a3f9c>] kmalloc_node_trace+0x2a/0x130
    [<000000004e107011>] io_wq_create+0x7b9/0xdc0
    [<00000000c38b2018>] io_uring_alloc_task_context+0x31e/0x59d
    [<00000000867399da>] __io_uring_add_tctx_node.cold+0x19/0x1ba
    [<000000007e0e7a79>] io_uring_setup.cold+0x1b80/0x1dce
    [<00000000b545e9f6>] __x64_sys_io_uring_setup+0x5d/0x80
    [<000000008a8a7508>] do_syscall_64+0x5d/0x90
    [<000000004ac08bec>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: 0e03496d19 ("io-wq: use private CPU mask")
Cc: stable@vger.kernel.org
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Link: https://lore.kernel.org/r/20221020014710.902201-1-rafaelmendsr@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-20 05:48:59 -07:00
Harshit Mogalapalli 16bbdfe5fb io_uring/msg_ring: Fix NULL pointer dereference in io_msg_send_fd()
Syzkaller produced the below call trace:

 BUG: KASAN: null-ptr-deref in io_msg_ring+0x3cb/0x9f0
 Write of size 8 at addr 0000000000000070 by task repro/16399

 CPU: 0 PID: 16399 Comm: repro Not tainted 6.1.0-rc1 #28
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7
 Call Trace:
  <TASK>
  dump_stack_lvl+0xcd/0x134
  ? io_msg_ring+0x3cb/0x9f0
  kasan_report+0xbc/0xf0
  ? io_msg_ring+0x3cb/0x9f0
  kasan_check_range+0x140/0x190
  io_msg_ring+0x3cb/0x9f0
  ? io_msg_ring_prep+0x300/0x300
  io_issue_sqe+0x698/0xca0
  io_submit_sqes+0x92f/0x1c30
  __do_sys_io_uring_enter+0xae4/0x24b0
....
 RIP: 0033:0x7f2eaf8f8289
 RSP: 002b:00007fff40939718 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2eaf8f8289
 RDX: 0000000000000000 RSI: 0000000000006f71 RDI: 0000000000000004
 RBP: 00007fff409397a0 R08: 0000000000000000 R09: 0000000000000039
 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004006d0
 R13: 00007fff40939880 R14: 0000000000000000 R15: 0000000000000000
  </TASK>
 Kernel panic - not syncing: panic_on_warn set ...

We don't have a NULL check on file_ptr in io_msg_send_fd() function,
so when file_ptr is NUL src_file is also NULL and get_file()
dereferences a NULL pointer and leads to above crash.

Add a NULL check to fix this issue.

Fixes: e6130eba8a ("io_uring: add support for passing fixed file descriptors")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Link: https://lore.kernel.org/r/20221019171218.1337614-1-harshit.m.mogalapalli@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-19 12:33:33 -07:00
Jens Axboe 5c61795ea9 io_uring/rw: remove leftover debug statement
This debug statement was never meant to go into the upstream release,
kill it off before it ends up in a release. It was just part of the
testing for the initial version of the patch.

Fixes: 2ec33a6c3c ("io_uring/rw: ensure kiocb_end_write() is always called")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-16 17:24:10 -06:00
Pavel Begunkov 02bac94bd8 io_uring: don't iopoll from io_ring_ctx_wait_and_kill()
We should not be completing requests from a task context that has already
undergone io_uring cancellations, i.e. __io_uring_cancel(), as there are
some assumptions, e.g. around cached task refs draining. Remove
iopolling from io_ring_ctx_wait_and_kill() as it can be called later
after PF_EXITING is set with the last task_work run.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7c03cc91455c4a1af49c6b9cbda4e57ea467aa11.1665891182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-16 17:08:42 -06:00
Pavel Begunkov 34f0bc427e io_uring: reuse io_alloc_req()
Don't duplicate io_alloc_req() in io_req_caches_free() but reuse the
helper.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6005fc88274864a49fc3096c22d8bdd605cf8576.1665891182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-16 17:08:42 -06:00
Pavel Begunkov 4d5059512d io_uring: kill hot path fixed file bitmap debug checks
We test file_table.bitmap in io_file_get_fixed() to check invariants,
don't do it, it's expensive and was showing up in profiles. No reports of
this triggering has come in. Move the check to the file clear instead,
which will still catch any wrong usage.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cf77f2ded68d2e5b2bc7355784d969837d48e023.1665891182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-16 17:07:53 -06:00
Pavel Begunkov 38eddb2c75 io_uring: remove FFS_SCM
THe lifetime of SCM'ed files is bound to ring_sock, which is destroyed
strictly after we're done with registered file tables. This means there
is no need for the FFS_SCM hack, which was not available on 32-bit builds
anyway.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/984226a1045adf42dc35d8bd7fb5a8bbfa472ce1.1665891182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-16 17:07:12 -06:00
Jens Axboe 2ec33a6c3c io_uring/rw: ensure kiocb_end_write() is always called
A previous commit moved the notifications and end-write handling, but
it is now missing a few spots where we also want to call both of those.
Without that, we can potentially be missing file notifications, and
more importantly, have an imbalance in the super_block writers sem
accounting.

Fixes: b000145e99 ("io_uring/rw: defer fsnotify calls to task context")
Reported-by: Dave Chinner <david@fromorbit.com>
Link: https://lore.kernel.org/all/20221010050319.GC2703033@dread.disaster.area/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Pavel Begunkov 00927931cb io_uring: fix fdinfo sqe offsets calculation
Only with the big sqe feature they take 128 bytes per entry, but we
unconditionally advance by 128B. Fix it by using sq_shift.

Fixes: 3b8fdd1dc3 ("io_uring/fdinfo: fix sqe dumping for IORING_SETUP_SQE128")
Reported-and-tested-by: syzbot+e5198737e8a2d23d958c@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b41287cb75d5efb8fcb5cccde845ddbbadd8372.1665449983.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Stefan Roesch c86416c6ff io_uring: local variable rw shadows outer variable in io_write
This fixes the shadowing of the outer variable rw in the function
io_write(). No issue is caused by this, but let's silence the shadowing
warning anyway.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Link: https://lore.kernel.org/r/20221010234330.244244-1-shr@devkernel.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Jens Axboe 11528491c6 io_uring/opdef: remove 'audit_skip' from SENDMSG_ZC
The msg variants of sending aren't audited separately, so we should not
be setting audit_skip for the zerocopy sendmsg variant either.

Fixes: 493108d95f ("io_uring/net: zerocopy sendmsg")
Reported-by: Paul Moore <paul@paul-moore.com>
Reviewed-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Pavel Begunkov 44f87745d5 io_uring: optimise locking for local tw with submit_wait
Running local task_work requires taking uring_lock, for submit + wait we
can try to run them right after submit while we still hold the lock and
save one lock/unlokc pair. The optimisation was implemented in the first
local tw patches but got dropped for simplicity.

Suggested-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/281fc79d98b5d91fe4778c5137a17a2ab4693e5c.1665088876.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Pavel Begunkov fc86f9d3bb io_uring: remove redundant memory barrier in io_req_local_work_add
io_cqring_wake() needs a barrier for the waitqueue_active() check.
However, in the case of io_req_local_work_add(), we call llist_add()
first, which implies an atomic. Hence we can replace smb_mb() with
smp_mb__after_atomic().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/43983bc8bc507172adda7a0f00cab1aff09fd238.1665018309.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Jens Axboe 3fb1bd6881 io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT
We treat EINPROGRESS like EAGAIN, but if we're retrying post getting
EINPROGRESS, then we just need to check the socket for errors and
terminate the request.

This was exposed on a bluetooth connection request which ends up
taking a while and hitting EINPROGRESS, and yields a CQE result of
-EBADFD because we're retrying a connect on a socket that is now
connected.

Cc: stable@vger.kernel.org
Fixes: 87f80d623c ("io_uring: handle connect -EINPROGRESS like -EAGAIN")
Link: https://github.com/axboe/liburing/issues/671
Reported-by: Aidan Sun <aidansun05@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Pavel Begunkov b7a817752e io_uring: remove notif leftovers
Notifications were killed but there is a couple of fields and struct
declarations left, remove them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8df8877d677be5a2b43afd936d600e60105ea960.1664849941.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Pavel Begunkov 42b6419d0a io_uring: correct pinned_vm accounting
->mm_account should be released only after we free all registered
buffers, otherwise __io_sqe_buffers_unregister() will see a NULL
->mm_account and skip locked_vm accounting.

Cc: <Stable@vger.kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6d798f65ed4ab8db3664c4d3397d4af16ca98846.1664849932.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Pavel Begunkov 0091bfc817 io_uring/af_unix: defer registered files gc to io_uring release
Instead of putting io_uring's registered files in unix_gc() we want it
to be done by io_uring itself. The trick here is to consider io_uring
registered files for cycle detection but not actually putting them down.
Because io_uring can't register other ring instances, this will remove
all refs to the ring file triggering the ->release path and clean up
with io_ring_ctx_free().

Cc: stable@vger.kernel.org
Fixes: 6b06314c47 ("io_uring: add file set registration")
Reported-and-tested-by: David Bouman <dbouman03@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
[axboe: add kerneldoc comment to skb, fold in skb leak fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:34 -06:00
Pavel Begunkov d7cce96c44 io_uring: limit registration w/ SINGLE_ISSUER
IORING_SETUP_SINGLE_ISSUER restricts what tasks can submit requests.
Extend it to registration as well, so non-owning task can't do
registrations. It's not necessary at the moment but might be useful in
the future.

Cc: <stable@vger.kernel.org> # 6.0
Fixes: 97bbdc06a4 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f52a6a9c8a8990d4a831f73c0571e7406aac2bba.1664237592.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-07 12:25:30 -06:00
Dylan Yudaken 4add705e4e io_uring: remove io_register_submitter
this is no longer needed, as submitter_task is set at creation time.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Fixes: 97bbdc06a4 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-07 12:25:30 -06:00
Dylan Yudaken 97c96e9fa3 io_uring: simplify __io_uring_add_tctx_node
Remove submitter parameter from __io_uring_add_tctx_node.

It was only called from one place, and we can do that logic in that one
place.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Fixes: 97bbdc06a4 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-07 12:25:30 -06:00
Linus Torvalds 7c989b1da3 for-6.1/passthrough-2022-10-04
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM8rp4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjTHD/9eeWwaG7oSSu5J1YzkKn+hptaDzZwreL98
 Mh8euiQScUVpvHGkNowBhjBZ5cIAAcYaH17rjW7dWu6A7tv/iqygWd/YvIbs1JOe
 STSD9yf0RV4dI0MG6Wu2w6YxObaLvE5BTRxqb/WuFWNTgsYf2HEp4PM9sTio71+H
 WwWdRvsIxsRxVYemds3vBxd+BcM8vm26EoUTSaCwRhfopaJwBNceCYIIrM7VHUNM
 5G6+DJkm3mB1a8nsdguYZQC/y8F/9P5Ch9CdxA12yOZEryr3wzsyRNGdm7oRmFGM
 bAkjFcddhwk5+SuTzGX6t4/Z3ODIjeCXbMBg4p7AShHws4Yx1trJePiqoNQ8xd5A
 PkMfxhQpBPlDFKLmwtObPLInyzMpp5P8KYMIZfyymKD/+XjmqAlR6TXbFUTihzBU
 lHSFhwG8ysT2cAVrFBMDJu4UPIThIHqfkkF/nTkHePTSArJ/k5rGV7v5sQpZ+jtY
 R0gvoNHTq2IvgKGEEbTgDjpwVcCn5ERVorZuGjVN2nMdLj35kXpo7YNgyYMaD5LJ
 9SOR5a8iQjjudAfdGyZCGzNaOecizVFjABozUYc1XJi/boNuFTsq4XCE/tCLTixc
 V4sElRpgrlXxNXkiVdbuWIPuYo4sDw5gqZQynpVNH5PkmX/NqmpWYVEWJ20o+pwg
 3ag39nZQVQ==
 =nwLk
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux

Pull passthrough updates from Jens Axboe:
 "With these changes, passthrough NVMe support over io_uring now
  performs at the same level as block device O_DIRECT, and in many cases
  6-8% better.

  This contains:

   - Add support for fixed buffers for passthrough (Anuj, Kanchan)

   - Enable batched allocations and freeing on passthrough, similarly to
     what we support on the normal storage path (me)

   - Fix from Geert fixing an issue with !CONFIG_IO_URING"

* tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux:
  io_uring: Add missing inline to io_uring_cmd_import_fixed() dummy
  nvme: wire up fixed buffer support for nvme passthrough
  nvme: pass ubuffer as an integer
  block: extend functionality to map bvec iterator
  block: factor out blk_rq_map_bio_alloc helper
  block: rename bio_map_put to blk_mq_map_bio_put
  nvme: refactor nvme_alloc_request
  nvme: refactor nvme_add_user_metadata
  nvme: Use blk_rq_map_user_io helper
  scsi: Use blk_rq_map_user_io helper
  block: add blk_rq_map_user_io
  io_uring: introduce fixed buffer support for io_uring_cmd
  io_uring: add io_uring_cmd_import_fixed
  nvme: enable batched completions of passthrough IO
  nvme: split out metadata vs non metadata end_io uring_cmd completions
  block: allow end_io based requests in the completion batch handling
  block: change request end_io handler to pass back a return value
  block: enable batched allocation for blk_mq_alloc_request()
  block: kill deprecated BUG_ON() in the flush handling
2022-10-07 09:35:50 -07:00
Linus Torvalds 513389809e for-6.1/block-2022-10-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67XkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiHoD/9eN+6YnNRPu5+2zeGnnm1Nlwic6YMZeORr
 KFIeC0COMWoFhNBIPFkgAKT+0qIH+uGt5UsHSM3Y5La7wMR8yLxD4PAnvTZ/Ijtt
 yxVIOmonJoQ0OrQ2kTbvDXL/9OCUrzwXXyUIEPJnH0Ca1mxeNOgDHbE7VGF6DMul
 0D3pI8qs2WLnHlDi1V/8kH5qZ6WoAJSDcb8sTzOUVnyveZPNaZhGQJuHA2XAYMtg
 fqKMDJqgmNk6jdTMUgdF5B+rV64PQoCy28I7fXqGkEe+RE5TBy57vAa0XY84V8XR
 /a8CEuwMts2ypk1hIcJG8Vv8K6u5war9yPM5MTngKsoMpzNIlhrhaJQVyjKdcs+E
 Ixwzexu6xTYcrcq+mUARgeTh79FzTBM/uXEdbCG2G3S6HPd6UZWUJZGfxw/l0Aem
 V4xB7lj6SQaJDU1iJCYUaHcekNXhQAPvyVG+R2ED1SO3McTpTPIM1aeigxw6vj7u
 bH3Kfdr94Z8HNuoLuiS6YYfjNt2Shf4LEB6GxKJ9TYHtyhdOyO0H64jGHpygrWqN
 cSnkWPUqUUNpF7srKM0ZgbliCshvmyJc4aMOFd0gBY/kXf5J/j7IXvh8TFCi9rHH
 0KyZH3/3Zsu9geUn3ynznlr4FXU+BcqE6boaa/iWb9sN1m+Rvaahv8cSch/dh44a
 vQNj/iOBQA==
 =R05e
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Christoph:
      - handle number of queue changes in the TCP and RDMA drivers
        (Daniel Wagner)
      - allow changing the number of queues in nvmet (Daniel Wagner)
      - also consider host_iface when checking ip options (Daniel
        Wagner)
      - don't map pages which can't come from HIGHMEM (Fabio M. De
        Francesco)
      - avoid unnecessary flush bios in nvmet (Guixin Liu)
      - shrink and better pack the nvme_iod structure (Keith Busch)
      - add comment for unaligned "fake" nqn (Linjun Bao)
      - print actual source IP address through sysfs "address" attr
        (Martin Belanger)
      - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang)
      - handle effects after freeing the request (Keith Busch)
      - copy firmware_rev on each init (Keith Busch)
      - restrict management ioctls to admin (Keith Busch)
      - ensure subsystem reset is single threaded (Keith Busch)
      - report the actual number of tagset maps in nvme-pci (Keith
        Busch)
      - small fabrics authentication fixups (Christoph Hellwig)
      - add common code for tagset allocation and freeing (Christoph
        Hellwig)
      - stop using the request_queue in nvmet (Christoph Hellwig)
      - set min_align_mask before calculating max_hw_sectors (Rishabh
        Bhatnagar)
      - send a rediscover uevent when a persistent discovery controller
        reconnects (Sagi Grimberg)
      - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)

 - MD pull request via Song:
      - Various raid5 fix and clean up, by Logan Gunthorpe and David
        Sloan.
      - Raid10 performance optimization, by Yu Kuai.

 - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu)

 - IO scheduler switching quisce fix (Keith)

 - s390/dasd block driver updates (Stefan)

 - support for recovery for the ublk driver (ZiyangZhang)

 - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph)

 - blk-mq and null_blk map fixes (Bart)

 - various bcache fixes (Coly, Jilin, Jules)

 - nbd signal hang fix (Shigeru)

 - block writeback throttling fix (Yu)

 - optimize the passthrough mapping handling (me)

 - prepare block cgroups to being gendisk based (Christoph)

 - get rid of an old PSI hack in the block layer, moving it to the
   callers instead where it belongs (Christoph)

 - blk-throttle fixes and cleanups (Yu)

 - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj,
   Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming,
   Miaohe, Bart, Coly, Gaosheng

* tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits)
  sbitmap: fix lockup while swapping
  block: add rationale for not using blk_mq_plug() when applicable
  block: adapt blk_mq_plug() to not plug for writes that require a zone lock
  s390/dasd: use blk_mq_alloc_disk
  blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
  nvmet: don't look at the request_queue in nvmet_bdev_set_limits
  nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
  blk-mq: use quiesced elevator switch when reinitializing queues
  block: replace blk_queue_nowait with bdev_nowait
  nvme: remove nvme_ctrl_init_connect_q
  nvme-loop: use the tagset alloc/free helpers
  nvme-loop: store the generic nvme_ctrl in set->driver_data
  nvme-loop: initialize sqsize later
  nvme-fc: use the tagset alloc/free helpers
  nvme-fc: store the generic nvme_ctrl in set->driver_data
  nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
  nvme-rdma: use the tagset alloc/free helpers
  nvme-rdma: store the generic nvme_ctrl in set->driver_data
  nvme-tcp: use the tagset alloc/free helpers
  nvme-tcp: store the generic nvme_ctrl in set->driver_data
  ...
2022-10-07 09:19:14 -07:00
Linus Torvalds 0a78a376ef for-6.1/io_uring-2022-10-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67S0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppnPEACkBzilBLKwT9MWdUAITwyrMXsAa1R9gsR9
 Tb3Xs+mNO2meuycLAUh4LIbb28NNr7/S5rwWet5NRZ71hgv4Q/WA/0EemAGGXYqd
 +3MEBAWU3FBFkC/cJXCnT8F5yCXYRkT5n/hzCSNEpNKjQ5JnAhHDlWAjgzZRuD/A
 A+YJjoBVJJuI1wY4I5XCpeQXEmg/Wc1MgXfyHgWVtGKnYrrxibiCnBZnqbAMZNvD
 hGn1Vl02ooamGTFm/nW/OAt71DtqsjWUCVOHKmlZ+zBUjbUj6FMXmPVV7vCV9o2w
 PT4Dx3CTc2iXwa8KfEFNPvXBzy0Qfu8edweP/MvZHWHVZREpEAh4cG6GhwW8whD+
 5mPisqmRjZKe0BBS4k/wKN1RXEypSQoTU4EdljfbQPU/usn35lmjMmEXXgs3IhqM
 fcTdO5ZUOp+CGyzI0Bc7UtS8vilJbX9ynN8G80MUUAZzuQg39MH7lNQYSJSSvJfU
 OlvzmL3lhRLYM1s/KKiZzdDBoMvC7R4oHmzCveOjQTMIHf6WNyqKFlrWScq2wzpN
 oRxqt0xiVQ3PFMmFj6N08f145qtbASuF3sKv7dbU3QXTsCAos3wdTdX+PejYApEZ
 W3dr0TDjNBicLNVPiSj132p0ZRtdZvLGuGVkBD4GPQeH2NwswxMHQAfz8e2lqmA4
 9bWG6BM7Yw==
 =m9kX
 -----END PGP SIGNATURE-----

Merge tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Add supported for more directly managed task_work running.

   This is beneficial for real world applications that end up issuing
   lots of system calls as part of handling work. Normal task_work will
   always execute as we transition in and out of the kernel, even for
   "unrelated" system calls. It's more efficient to defer the handling
   of io_uring's deferred work until the application wants it to be run,
   generally in batches.

   As part of ongoing work to write an io_uring network backend for
   Thrift, this has been shown to greatly improve performance. (Dylan)

 - Add IOPOLL support for passthrough (Kanchan)

 - Improvements and fixes to the send zero-copy support (Pavel)

 - Partial IO handling fixes (Pavel)

 - CQE ordering fixes around CQ ring overflow (Pavel)

 - Support sendto() for non-zc as well (Pavel)

 - Support sendmsg for zerocopy (Pavel)

 - Networking iov_iter fix (Stefan)

 - Misc fixes and cleanups (Pavel, me)

* tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux: (56 commits)
  io_uring/net: fix notif cqe reordering
  io_uring/net: don't update msg_name if not provided
  io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
  io_uring/rw: defer fsnotify calls to task context
  io_uring/net: fix fast_iov assignment in io_setup_async_msg()
  io_uring/net: fix non-zc send with address
  io_uring/net: don't skip notifs for failed requests
  io_uring/rw: don't lose short results on io_setup_async_rw()
  io_uring/rw: fix unexpected link breakage
  io_uring/net: fix cleanup double free free_iov init
  io_uring: fix CQE reordering
  io_uring/net: fix UAF in io_sendrecv_fail()
  selftest/net: adjust io_uring sendzc notif handling
  io_uring: ensure local task_work marks task as running
  io_uring/net: zerocopy sendmsg
  io_uring/net: combine fail handlers
  io_uring/net: rename io_sendzc()
  io_uring/net: support non-zerocopy sendto
  io_uring/net: refactor io_setup_async_addr
  io_uring/net: don't lose partial send_zc on fail
  ...
2022-10-07 08:52:43 -07:00
Linus Torvalds 4c0ed7d8d6 whack-a-mole: constifying struct path *
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzxmRQAKCRBZ7Krx/gZQ
 6+/kAQD2xyf+i4zOYVBr1NB3qBbhVS1zrni1NbC/kT3dJPgTvwEA7z7eqwnrN4zg
 scKFP8a3yPoaQBfs4do5PolhuSr2ngA=
 =NBI+
 -----END PGP SIGNATURE-----

Merge tag 'pull-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs constification updates from Al Viro:
 "whack-a-mole: constifying struct path *"

* tag 'pull-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ecryptfs: constify path
  spufs: constify path
  nd_jump_link(): constify path
  audit_init_parent(): constify path
  __io_setxattr(): constify path
  do_proc_readlink(): constify path
  overlayfs: constify path
  fs/notify: constify path
  may_linkat(): constify path
  do_sys_name_to_handle(): constify path
  ->getprocattr(): attribute name is const char *, TYVM...
2022-10-06 17:31:02 -07:00
Linus Torvalds a0debc4c7a io_uring-6.0-2022-09-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM2W6oQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgproBD/9qkKpoPdXWr/dAsHUXMYclF6zG6ds3HKSi
 wK6Uo4T/HvEmTcHC5sQ51UhNGMSXgGW7d3ijkcG3Uz2jjpemCALIErT6nW4fGb8u
 p0WX3qn5TIgzvRJUD5/nnSMN5RMJ2U/HExx9fve2xI/92WosdgeyugazX62RgKNr
 qxB7UMf6hHpl/9Yb/bTe7g+jpRyRnSTACT0gz+L11dhxtU6Bu8SBg+PVgsybsxYn
 v/DaJPpbSKNzQ0/nSmMDL8GuNwbU6LFXQ/6wkGyT6uXhJAOo7+ae1nHu4e6REQXa
 3xJC9zmSHGihyi85Mn33vcMolUcaWV5Tcz5Sllr4euUpJSZgKpsSZPgI7X7ZFC7A
 2RGhn9Xbv/MlLPlxEL19dcRf8uutjXaRMIS5AfLU3xT0R/b3F4P2Bv0RTiEt71dT
 quicp4XCcZpMMezI9M0QckWZ9LSCnR3F3e5Ucw55VBwD+UfHKVIgnW1NhzKAO9s0
 5VDPFUo9Nwf8l0+qPBwd9AxJvrkzbi6YmyM4a2dVuC5dJesEvtSpMKn4o2taSdP4
 pTQoQYwt5PdOKkwC5L9UsiHttSGb4Dp0SQoShCYoI+oSlIHeQrdJhKgX1JF0fmrf
 55xWVDVCgLP6+YnziCty57Albmq04OS+okBUjpuswrZ68yVqFS5QzfztaErBtbIi
 ESj8eFCVrw==
 =ZmK2
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.0-2022-09-29' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:
 "Two fixes that should go into 6.0:

   - Tweak the single issuer logic to register the task at creation,
     rather than at first submit. SINGLE_ISSUER was added for 6.0, and
     after some discussion on this, we decided to make it a bit stricter
     while it's still possible to do so (Dylan).

   - Stefan from Samba had some doubts on the level triggered poll that
     was added for this release. Rather than attempt to mess around with
     it now, just do the quick one-liner to disable it for release and
     we have time to discuss and change it for 6.1 instead (me)"

* tag 'io_uring-6.0-2022-09-29' of git://git.kernel.dk/linux:
  io_uring/poll: disable level triggered poll
  io_uring: register single issuer task at creation
2022-09-30 09:28:39 -07:00
Anuj Gupta 9cda70f622 io_uring: introduce fixed buffer support for io_uring_cmd
Add IORING_URING_CMD_FIXED flag that is to be used for sending io_uring
command with previously registered buffers. User-space passes the buffer
index in sqe->buf_index, same as done in read/write variants that uses
fixed buffers.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-3-anuj20.g@samsung.com
[axboe: shuffle valid flags check before acting on it]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:50:59 -06:00
Anuj Gupta a9216fac3e io_uring: add io_uring_cmd_import_fixed
This is a new helper that callers can use to obtain a bvec iterator for
the previously mapped buffer. This is preparatory work to enable
fixed-buffer support for io_uring_cmd.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-2-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:49:11 -06:00
Jens Axboe 5853a7b551 Merge branch 'for-6.1/io_uring' into for-6.1/passthrough
* for-6.1/io_uring: (56 commits)
  io_uring/net: fix notif cqe reordering
  io_uring/net: don't update msg_name if not provided
  io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
  io_uring/rw: defer fsnotify calls to task context
  io_uring/net: fix fast_iov assignment in io_setup_async_msg()
  io_uring/net: fix non-zc send with address
  io_uring/net: don't skip notifs for failed requests
  io_uring/rw: don't lose short results on io_setup_async_rw()
  io_uring/rw: fix unexpected link breakage
  io_uring/net: fix cleanup double free free_iov init
  io_uring: fix CQE reordering
  io_uring/net: fix UAF in io_sendrecv_fail()
  selftest/net: adjust io_uring sendzc notif handling
  io_uring: ensure local task_work marks task as running
  io_uring/net: zerocopy sendmsg
  io_uring/net: combine fail handlers
  io_uring/net: rename io_sendzc()
  io_uring/net: support non-zerocopy sendto
  io_uring/net: refactor io_setup_async_addr
  io_uring/net: don't lose partial send_zc on fail
  ...
2022-09-30 07:47:42 -06:00
Jens Axboe 736feaa3a0 Merge branch 'for-6.1/block' into for-6.1/passthrough
* for-6.1/block: (162 commits)
  sbitmap: fix lockup while swapping
  block: add rationale for not using blk_mq_plug() when applicable
  block: adapt blk_mq_plug() to not plug for writes that require a zone lock
  s390/dasd: use blk_mq_alloc_disk
  blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
  nvmet: don't look at the request_queue in nvmet_bdev_set_limits
  nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
  blk-mq: use quiesced elevator switch when reinitializing queues
  block: replace blk_queue_nowait with bdev_nowait
  nvme: remove nvme_ctrl_init_connect_q
  nvme-loop: use the tagset alloc/free helpers
  nvme-loop: store the generic nvme_ctrl in set->driver_data
  nvme-loop: initialize sqsize later
  nvme-fc: use the tagset alloc/free helpers
  nvme-fc: store the generic nvme_ctrl in set->driver_data
  nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
  nvme-rdma: use the tagset alloc/free helpers
  nvme-rdma: store the generic nvme_ctrl in set->driver_data
  nvme-tcp: use the tagset alloc/free helpers
  nvme-tcp: store the generic nvme_ctrl in set->driver_data
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30 07:47:38 -06:00
Pavel Begunkov 108893ddcc io_uring/net: fix notif cqe reordering
send zc is not restricted to !IO_URING_F_UNLOCKED anymore and so
we can't use task-tw ordering trick to order notification cqes
with requests completions. In this case leave it alone and let
io_send_zc_cleanup() flush it.

Cc: stable@vger.kernel.org
Fixes: 53bdc88aac ("io_uring/notif: order notif vs send CQEs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0031f3a00d492e814a4a0935a2029a46d9c9ba06.1664486545.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29 17:46:04 -06:00
Pavel Begunkov 6f10ae8a15 io_uring/net: don't update msg_name if not provided
io_sendmsg_copy_hdr() may clear msg->msg_name if the userspace didn't
provide it, we should retain NULL in this case.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/97d49f61b5ec76d0900df658cfde3aa59ff22121.1664486545.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29 17:46:04 -06:00
Jens Axboe 46a525e199 io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
This isn't a reliable mechanism to tell if we have task_work pending, we
really should be looking at whether we have any items queued. This is
problematic if forward progress is gated on running said task_work. One
such example is reading from a pipe, where the write side has been closed
right before the read is started. The fput() of the file queues TWA_RESUME
task_work, and we need that task_work to be run before ->release() is
called for the pipe. If ->release() isn't called, then the read will sit
forever waiting on data that will never arise.

Fix this by io_run_task_work() so it checks if we have task_work pending
rather than rely on TIF_NOTIFY_SIGNAL for that. The latter obviously
doesn't work for task_work that is queued without TWA_SIGNAL.

Reported-by: Christiano Haesbaert <haesbaert@haesbaert.org>
Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/665
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29 16:07:45 -06:00
Jens Axboe b000145e99 io_uring/rw: defer fsnotify calls to task context
We can't call these off the kiocb completion as that might be off
soft/hard irq context. Defer the calls to when we process the
task_work for this request. That avoids valid complaints like:

stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.0.0-rc6-syzkaller-00321-g105a36f3694e #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 print_usage_bug kernel/locking/lockdep.c:3961 [inline]
 valid_state kernel/locking/lockdep.c:3973 [inline]
 mark_lock_irq kernel/locking/lockdep.c:4176 [inline]
 mark_lock.part.0.cold+0x18/0xd8 kernel/locking/lockdep.c:4632
 mark_lock kernel/locking/lockdep.c:4596 [inline]
 mark_usage kernel/locking/lockdep.c:4527 [inline]
 __lock_acquire+0x11d9/0x56d0 kernel/locking/lockdep.c:5007
 lock_acquire kernel/locking/lockdep.c:5666 [inline]
 lock_acquire+0x1ab/0x570 kernel/locking/lockdep.c:5631
 __fs_reclaim_acquire mm/page_alloc.c:4674 [inline]
 fs_reclaim_acquire+0x115/0x160 mm/page_alloc.c:4688
 might_alloc include/linux/sched/mm.h:271 [inline]
 slab_pre_alloc_hook mm/slab.h:700 [inline]
 slab_alloc mm/slab.c:3278 [inline]
 __kmem_cache_alloc_lru mm/slab.c:3471 [inline]
 kmem_cache_alloc+0x39/0x520 mm/slab.c:3491
 fanotify_alloc_fid_event fs/notify/fanotify/fanotify.c:580 [inline]
 fanotify_alloc_event fs/notify/fanotify/fanotify.c:813 [inline]
 fanotify_handle_event+0x1130/0x3f40 fs/notify/fanotify/fanotify.c:948
 send_to_group fs/notify/fsnotify.c:360 [inline]
 fsnotify+0xafb/0x1680 fs/notify/fsnotify.c:570
 __fsnotify_parent+0x62f/0xa60 fs/notify/fsnotify.c:230
 fsnotify_parent include/linux/fsnotify.h:77 [inline]
 fsnotify_file include/linux/fsnotify.h:99 [inline]
 fsnotify_access include/linux/fsnotify.h:309 [inline]
 __io_complete_rw_common+0x485/0x720 io_uring/rw.c:195
 io_complete_rw+0x1a/0x1f0 io_uring/rw.c:228
 iomap_dio_complete_work fs/iomap/direct-io.c:144 [inline]
 iomap_dio_bio_end_io+0x438/0x5e0 fs/iomap/direct-io.c:178
 bio_endio+0x5f9/0x780 block/bio.c:1564
 req_bio_endio block/blk-mq.c:695 [inline]
 blk_update_request+0x3fc/0x1300 block/blk-mq.c:825
 scsi_end_request+0x7a/0x9a0 drivers/scsi/scsi_lib.c:541
 scsi_io_completion+0x173/0x1f70 drivers/scsi/scsi_lib.c:971
 scsi_complete+0x122/0x3b0 drivers/scsi/scsi_lib.c:1438
 blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1022
 __do_softirq+0x1d3/0x9c6 kernel/softirq.c:571
 invoke_softirq kernel/softirq.c:445 [inline]
 __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
 irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
 common_interrupt+0xa9/0xc0 arch/x86/kernel/irq.c:240

Fixes: f63cf5192f ("io_uring: ensure that fsnotify is always called")
Link: https://lore.kernel.org/all/20220929135627.ykivmdks2w5vzrwg@quack3/
Reported-by: syzbot+dfcc5f4da15868df7d4d@syzkaller.appspotmail.com
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29 11:00:41 -06:00
Stefan Metzmacher 3e4cb6ebbb io_uring/net: fix fast_iov assignment in io_setup_async_msg()
I hit a very bad problem during my tests of SENDMSG_ZC.
BUG(); in first_iovec_segment() triggered very easily.
The problem was io_setup_async_msg() in the partial retry case,
which seems to happen more often with _ZC.

iov_iter_iovec_advance() may change i->iov in order to have i->iov_offset
being only relative to the first element.

Which means kmsg->msg.msg_iter.iov is no longer the
same as kmsg->fast_iov.

But this would rewind the copy to be the start of
async_msg->fast_iov, which means the internal
state of sync_msg->msg.msg_iter is inconsitent.

I tested with 5 vectors with length like this 4, 0, 64, 20, 8388608
and got a short writes with:
- ret=2675244 min_ret=8388692 => remaining 5713448 sr->done_io=2675244
- ret=-EAGAIN => io_uring_poll_arm
- ret=4911225 min_ret=5713448 => remaining 802223  sr->done_io=7586469
- ret=-EAGAIN => io_uring_poll_arm
- ret=802223  min_ret=802223  => res=8388692

While this was easily triggered with SENDMSG_ZC (queued for 6.1),
it was a potential problem starting with 7ba89d2af1
in 5.18 for IORING_OP_RECVMSG.
And also with 4c3c09439c in 5.19
for IORING_OP_SENDMSG.

However 257e84a537 introduced the critical
code into io_setup_async_msg() in 5.11.

Fixes: 7ba89d2af1 ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly")
Fixes: 257e84a537 ("io_uring: refactor sendmsg/recvmsg iov managing")
Cc: stable@vger.kernel.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2e7be246e2fb173520862b0c7098e55767567a2.1664436949.git.metze@samba.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29 07:08:21 -06:00
Pavel Begunkov 04360d3e05 io_uring/net: fix non-zc send with address
We're currently ignoring the dest address with non-zerocopy send because
even though we copy it from the userspace shortly after ->msg_name gets
zeroed. Move msghdr init earlier.

Fixes: 516e82f0e0 ("io_uring/net: support non-zerocopy sendto")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/176ced5e8568aa5d300ca899b7f05b303ebc49fd.1664409532.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-28 19:29:12 -06:00
Jens Axboe d59bd748db io_uring/poll: disable level triggered poll
Stefan reports that there are issues with the level triggered
notification. Since we're late in the cycle, and it was introduced for
the 6.0 release, just disable it at prep time and we can bring this
back when Samba is happy with it.

Reported-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-28 19:27:11 -06:00
Pavel Begunkov 6ae91ac9a6 io_uring/net: don't skip notifs for failed requests
We currently only add a notification CQE when the send succeded, i.e.
cqe.res >= 0. However, it'd be more robust to do buffer notifications
for failed requests as well in case drivers decide do something fanky.

Always return a buffer notification after initial prep, don't hide it.
This behaviour is better aligned with documentation and the patch also
helps the userspace to respect it.

Cc: stable@vger.kernel.org # 6.0
Suggested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9c8bead87b2b980fcec441b8faef52188b4a6588.1664292100.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-28 07:53:15 -06:00
Christoph Hellwig 568ec936bf block: replace blk_queue_nowait with bdev_nowait
Replace blk_queue_nowait with a bdev_nowait helpers that takes the
block_device given that the I/O submission path should not have to
look into the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-27 09:57:58 -06:00
Pavel Begunkov c278d9f8ac io_uring/rw: don't lose short results on io_setup_async_rw()
If a retry io_setup_async_rw() fails we lose result from the first
io_iter_do_read(), which is a problem mostly for streams/sockets.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0e8d20cebe5fc9c96ed268463c394237daabc384.1664235732.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 18:44:15 -06:00
Pavel Begunkov bf68b5b343 io_uring/rw: fix unexpected link breakage
req->cqe.res is set in io_read() to the amount of bytes left to be done,
which is used to figure out whether to fail a read or not. However,
io_read() may do another without returning, and we stash the previous
value into ->bytes_done but forget to update cqe.res. Then we ask a read
to do strictly less than cqe.res but expect the return to be exactly
cqe.res.

Fix the bug by updating cqe.res for retries.

Cc: stable@vger.kernel.org
Reported-and-Tested-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3a1088440c7be98e5800267af922a67da0ef9f13.1664235732.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 18:44:15 -06:00
Dylan Yudaken 7cae596bc3 io_uring: register single issuer task at creation
Instead of picking the task from the first submitter task, rather use the
creator task or in the case of disabled (IORING_SETUP_R_DISABLED) the
enabling task.

This approach allows a lot of simplification of the logic here. This
removes init logic from the submission path, which can always be a bit
confusing, but also removes the need for locking to write (or read) the
submitter_task.

Users that want to move a ring before submitting can create the ring
disabled and then enable it on the submitting task.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Fixes: 97bbdc06a4 ("io_uring: add IORING_SETUP_SINGLE_ISSUER")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 11:26:18 -06:00
Pavel Begunkov 4c17a496a7 io_uring/net: fix cleanup double free free_iov init
Having ->async_data doesn't mean it's initialised and previously we vere
relying on setting F_CLEANUP at the right moment. With zc sendmsg
though, we set F_CLEANUP early in prep when we alloc a notif and so we
may allocate async_data, fail in copy_msg_hdr() leaving
struct io_async_msghdr not initialised correctly but with F_CLEANUP
set, which causes a ->free_iov double free and probably other nastiness.

Always initialise ->free_iov. Also, now it might point to fast_iov when
fails, so avoid freeing it during cleanups.

Reported-by: syzbot+edfd15cd4246a3fc615a@syzkaller.appspotmail.com
Fixes: 493108d95f ("io_uring/net: zerocopy sendmsg")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26 08:36:50 -06:00
Linus Torvalds 3db61221f4 io_uring-6.0-2022-09-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmMuVUEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvvDD/41UF2c5fTS7T4D4aRK4cR9UJGYe32obxee
 Li3tVvHHFfV32WfNhsGR8P6bEwNVJhngHO65eTKMHmydQ4WhUAzW4U77/dcCITRy
 c1jn9kuFTQkWJslkRs4laXS7siTSr42R650dwJZCHQk2wW96bY/QKgmoWLWFBkBw
 g7OY4O4mv/yebJq7mtBjsbOS/3zaOmf7CZIykBPlaaL6Y+7aKZ+NSjx2VUz/ZKHy
 od2wbHw6Bo7E5eNtZIM6brM4rJTATObcWR3bOoHOgxf/9Ah4pj+Tkfdm1pO1uAg6
 BEFOinAJ5naCB2u3Ycya4SvHgC6RgPGbb/u1H6AfGVPSUuMn6ggnoUGM8bg2UCKn
 U0Rr9C0g84tDA565JSQzHShA8YUwAb5X6OuG+cU04a1Nx3udRQOdGLdWCPa/gJUJ
 asBpqv8tOrpaS4ANMG+SmAWj7OIaDvEXUr2Gs3acz1FSWYPyXfVR0sUUsHuXToHX
 5JPfq5NQZBgoXKUXeYb3+FRgkp9DF+EkmrU1uGtU3718B8QAcpa10Vp+H+Vu2pnv
 59A/DjN9QjEH606KU94o1zkaC47/llkIm22BunrrQFcDKtcQUzO/Q9HO3ILm7nqH
 6oUIlCAU81rN55gH+M3pMMtm/dSroeCLxQU2tu4XaBYCsb+QfI1LXCDT/CO0Nb3I
 TzyXboIrvg==
 =F9sU
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.0-2022-09-23' of git://git.kernel.dk/linux

Pull io_uring fix from Jens Axboe:
 "Just a single fix for an issue with un-reaped IOPOLL requests on ring
  exit"

* tag 'io_uring-6.0-2022-09-23' of git://git.kernel.dk/linux:
  io_uring: ensure that cached task references are always put on exit
2022-09-24 08:27:08 -07:00
Jens Axboe e775f93f2a io_uring: ensure that cached task references are always put on exit
io_uring caches task references to avoid doing atomics for each of them
per request. If a request is put from the same task that allocated it,
then we can maintain a per-ctx cache of them. This obviously relies
on io_uring always pruning caches in a reliable way, and there's
currently a case off io_uring fd release where we can miss that.

One example is a ring setup with IOPOLL, which relies on the task
polling for completions, which will free them. However, if such a task
submits a request and then exits or closes the ring without reaping
the completion, then ring release will reap and put. If release happens
from that very same task, the completed request task refs will get
put back into the cache pool. This is problematic, as we're now beyond
the point of pruning caches.

Manually drop these caches after doing an IOPOLL reap. This releases
references from the current task, which is enough. If another task
happens to be doing the release, then the caching will not be
triggered and there's no issue.

Cc: stable@vger.kernel.org
Fixes: e98e49b2bb ("io_uring: extend task put optimisations")
Reported-by: Homin Rhee <hominlab@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23 18:51:08 -06:00
Pavel Begunkov aa1df3a360 io_uring: fix CQE reordering
Overflowing CQEs may result in reordering, which is buggy in case of
links, F_MORE and so on. If we guarantee that we don't reorder for
the unlikely event of a CQ ring overflow, then we can further extend
this to not have to terminate multishot requests if it happens. For
other operations, like zerocopy sends, we have no choice but to honor
CQE ordering.

Reported-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ec3bc55687b0768bbe20fb62d7d06cfced7d7e70.1663892031.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23 15:04:20 -06:00
Pavel Begunkov a75155faef io_uring/net: fix UAF in io_sendrecv_fail()
We should not assume anything about ->free_iov just from
REQ_F_ASYNC_DATA but rather rely on REQ_F_NEED_CLEANUP, as we may
allocate ->async_data but failed init would leave the field in not
consistent state. The easiest solution is to remove removing
REQ_F_NEED_CLEANUP and so ->async_data dealloc from io_sendrecv_fail()
and let io_send_zc_cleanup() do the job. The catch here is that we also
need to prevent double notif flushing, just test it for NULL and zero
where it's needed.

BUG: KASAN: use-after-free in io_sendrecv_fail+0x3b0/0x3e0 io_uring/net.c:1221
Write of size 8 at addr ffff8880771b4080 by task syz-executor.3/30199

CPU: 1 PID: 30199 Comm: syz-executor.3 Not tainted 6.0.0-rc6-next-20220923-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 print_address_description mm/kasan/report.c:284 [inline]
 print_report+0x15e/0x45d mm/kasan/report.c:395
 kasan_report+0xbb/0x1f0 mm/kasan/report.c:495
 io_sendrecv_fail+0x3b0/0x3e0 io_uring/net.c:1221
 io_req_complete_failed+0x155/0x1b0 io_uring/io_uring.c:873
 io_drain_req io_uring/io_uring.c:1648 [inline]
 io_queue_sqe_fallback.cold+0x29f/0x788 io_uring/io_uring.c:1931
 io_submit_sqe io_uring/io_uring.c:2160 [inline]
 io_submit_sqes+0x1180/0x1df0 io_uring/io_uring.c:2276
 __do_sys_io_uring_enter+0xac6/0x2410 io_uring/io_uring.c:3216
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: c4c0009e0b ("io_uring/net: combine fail handlers")
Reported-by: syzbot+4c597a574a3f5a251bda@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/23ab8346e407ea50b1198a172c8a97e1cf22915b.1663945875.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23 14:57:38 -06:00
Jens Axboe ec7fd2562f io_uring: ensure local task_work marks task as running
io_uring will run task_work from contexts that have been prepared for
waiting, and in doing so it'll implicitly set the task running again
to avoid issues with blocking conditions. The new deferred local
task_work doesn't do that, which can result in spews on this being
an invalid condition:



[  112.917576] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000ad64af64>] prepare_to_wait_exclusive+0x3f/0xd0
[  112.983088] WARNING: CPU: 1 PID: 190 at kernel/sched/core.c:9819 __might_sleep+0x5a/0x60
[  112.987240] Modules linked in:
[  112.990504] CPU: 1 PID: 190 Comm: io_uring Not tainted 6.0.0-rc6+ #1617
[  113.053136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[  113.133650] RIP: 0010:__might_sleep+0x5a/0x60
[  113.136507] Code: ee 48 89 df 5b 31 d2 5d e9 33 ff ff ff 48 8b 90 30 0b 00 00 48 c7 c7 90 de 45 82 c6 05 20 8b 79 01 01 48 89 d1 e8 3a 49 77 00 <0f> 0b eb d1 66 90 0f 1f 44 00 00 9c 58 f6 c4 02 74 35 65 8b 05 ed
[  113.223940] RSP: 0018:ffffc90000537ca0 EFLAGS: 00010286
[  113.232903] RAX: 0000000000000000 RBX: ffffffff8246782c RCX: ffffffff8270bcc8
IOPS=133.15K, BW=520MiB/s, IOS/call=32/31
[  113.353457] RDX: ffffc90000537b50 RSI: 00000000ffffdfff RDI: 0000000000000001
[  113.358970] RBP: 00000000000003bc R08: 0000000000000000 R09: c0000000ffffdfff
[  113.361746] R10: 0000000000000001 R11: ffffc90000537b48 R12: ffff888103f97280
[  113.424038] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
[  113.428009] FS:  00007f67ae7fc700(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000
[  113.432794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  113.503186] CR2: 00007f67b8b9b3b0 CR3: 0000000102b9b005 CR4: 0000000000770ee0
[  113.507291] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  113.512669] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  113.574374] PKRU: 55555554
[  113.576800] Call Trace:
[  113.578325]  <TASK>
[  113.579799]  set_page_dirty_lock+0x1b/0x90
[  113.582411]  __bio_release_pages+0x141/0x160
[  113.673078]  ? set_next_entity+0xd7/0x190
[  113.675632]  blk_rq_unmap_user+0xaa/0x210
[  113.678398]  ? timerqueue_del+0x2a/0x40
[  113.679578]  nvme_uring_task_cb+0x94/0xb0
[  113.683025]  __io_run_local_work+0x8a/0x150
[  113.743724]  ? io_cqring_wait+0x33d/0x500
[  113.746091]  io_run_local_work.part.76+0x2e/0x60
[  113.750091]  io_cqring_wait+0x2e7/0x500
[  113.752395]  ? trace_event_raw_event_io_uring_req_failed+0x180/0x180
[  113.823533]  __x64_sys_io_uring_enter+0x131/0x3c0
[  113.827382]  ? switch_fpu_return+0x49/0xc0
[  113.830753]  do_syscall_64+0x34/0x80
[  113.832620]  entry_SYSCALL_64_after_hwframe+0x5e/0xc8

Ensure that we mark current as TASK_RUNNING for deferred task_work
as well.

Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Reported-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 19:39:35 -06:00
Pavel Begunkov 493108d95f io_uring/net: zerocopy sendmsg
Add a zerocopy version of sendmsg.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6aabc4bdfc0ec78df6ec9328137e394af9d4e7ef.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov c4c0009e0b io_uring/net: combine fail handlers
Merge io_send_zc_fail() into io_sendrecv_fail(), saves a few lines of
code and some headache for following patch.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e0eba1d577413aef5602cd45f588b9230207082d.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov b0e9b5517e io_uring/net: rename io_sendzc()
Simple renaming of io_sendzc*() functions in preparatio to adding
a zerocopy sendmsg variant.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/265af46829e6076dd220011b1858dc3151969226.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov 516e82f0e0 io_uring/net: support non-zerocopy sendto
We have normal sends, but what is missing is sendto-like requests. Add
sendto() capabilities to IORING_OP_SEND by passing in addr just as we do
for IORING_OP_SEND_ZC.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/69fbd8b2cb830e57d1bf9ec351e9bf95c5b77e3f.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov 6ae61b7aa2 io_uring/net: refactor io_setup_async_addr
Instead of passing the right address into io_setup_async_addr() only
specify local on-stack storage and let the function infer where to grab
it from. It optimises out one local variable we have to deal with.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6bfa9ab810d776853eb26ed59301e2536c3a5471.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov 5693bcce89 io_uring/net: don't lose partial send_zc on fail
Partial zc send may end up in io_req_complete_failed(), which not only
would return invalid result but also mask out the notification leading
to lifetime issues.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5673285b5e83e6ceca323727b4ddaa584b5cc91e.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov 7e6b638ed5 io_uring/net: don't lose partial send/recv on fail
Just as with rw, partial send/recv may end up in
io_req_complete_failed() and loose the result, make sure we return the
number of bytes processed.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a4ff95897b5419356fca9ea55db91ac15b2975f9.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov 47b4c68660 io_uring/rw: don't lose partial IO result on fail
A partially done read/write may end up in io_req_complete_failed() and
loose the result, make sure we return the number of bytes processed.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/05e0879c226bcd53b441bf92868eadd4bf04e2fc.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov a47b255e90 io_uring: add custom opcode hooks on fail
Sometimes we have to do a little bit of a fixup on a request failuer in
io_req_complete_failed(). Add a callback in opdef for that.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b734cff4e67cb30cca976b9face321023f37549a.1663668091.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Jens Axboe 3b8fdd1dc3 io_uring/fdinfo: fix sqe dumping for IORING_SETUP_SQE128
If we have doubly sized SQEs, then we need to shift the sq index by 1
to account for using two entries for a single request. The CQE dumping
gets this right, but the SQE one does not.

Improve the SQE dumping in general, the information dumped is pretty
sparse and doesn't even cover the whole basic part of the SQE. Include
information on the extended part of the SQE, if doubly sized SQEs are
in use. A typical dump now looks like the following:

[...]
SQEs:	32
   32: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2721, e0:0x0, e1:0xffffb8041000, e2:0x100000000000, e3:0x5500, e4:0x7, e5:0x0, e6:0x0, e7:0x0
   33: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2722, e0:0x0, e1:0xffffb8043000, e2:0x100000000000, e3:0x5508, e4:0x7, e5:0x0, e6:0x0, e7:0x0
   34: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2723, e0:0x0, e1:0xffffb8045000, e2:0x100000000000, e3:0x5510, e4:0x7, e5:0x0, e6:0x0, e7:0x0
[...]

Fixes: ebdeb7c01d ("io_uring: add support for 128-byte SQEs")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Jens Axboe 4f731705cc io_uring/fdinfo: get rid of unnecessary is_cqe32 variable
We already have the cq_shift, just use that to tell if we have doubly
sized CQEs or not.

While in there, cleanup the CQE32 vs normal CQE size printing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:02 -06:00
Pavel Begunkov c0dc995eb2 io_uring: remove unused return from io_disarm_next
We removed conditional io_commit_cqring_flush() guarding against
spurious eventfd and the io_disarm_next()'s return value is not used
anymore, just void it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9a441c9a32a58bcc586076fa9a7d0dc33f1fb3cb.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:01 -06:00
Pavel Begunkov 7924fdfeea io_uring: add fast path for io_run_local_work()
We'll grab uring_lock and call __io_run_local_work() with several
atomics inside even if there are no task works. Skip it if ->work_llist
is empty.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/f6a885f372bad2d77d9cd87341b0a86a4000c0ff.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:01 -06:00
Pavel Begunkov 1f8d5bbe98 io_uring/iopoll: unify tw breaking logic
Let's keep checks for whether to break the iopoll loop or not same for
normal and defer tw, this includes ->cached_cq_tail checks guarding
against polling more than asked for.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d2fa8a44f8114f55a4807528da438cde93815360.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:15:01 -06:00
Pavel Begunkov 9d54bd6a3b io_uring/iopoll: fix unexpected returns
We may propagate a positive return value of io_run_task_work() out of
io_iopoll_check(), which breaks our tests. io_run_task_work() doesn't
return anything useful for us, ignore the return value.

Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/c442bb87f79cea10b3f857cbd4b9a4f0a0493fa3.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:14:59 -06:00