Commit Graph

710 Commits

Author SHA1 Message Date
Jens Axboe e87fa62ebc io_uring/fdinfo: remove need for sqpoll lock for thread/pid retrieval
[ Upstream commit a0d45c3f59 ]

A previous commit added a trylock for getting the SQPOLL thread info via
fdinfo, but this introduced a regression where we often fail to get it if
the thread is busy. For that case, we end up not printing the current CPU
and PID info.

Rather than rely on this lock, just print the pid we already stored in
the io_sq_data struct, and ensure we update the current CPU every time
we've slept or potentially rescheduled. The latter won't potentially be
100% accurate, but that wasn't the case before either as the task can
get migrated at any time unless it has been pinned at creation time.

We retain keeping the io_sq_data dereference inside the ctx->uring_lock,
as it has always been, as destruction of the thread and data happen below
that. We could make this RCU safe, but there's little point in doing that.

With this, we always print the last valid information we had, rather than
have spurious outputs with missing information.

Fixes: 7644b1a1c9 ("io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28 17:14:56 +00:00
Jens Axboe 9e1c7fe853 io_uring/net: ensure socket is marked connected on connect retry
commit f8f9ab2d98 upstream.

io_uring does non-blocking connection attempts, which can yield some
unexpected results if a connect request is re-attempted by an an
application. This is equivalent to the following sync syscall sequence:

sock = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP);
connect(sock, &addr, sizeof(addr);

ret == -1 and errno == EINPROGRESS expected here. Now poll for POLLOUT
on sock, and when that returns, we expect the socket to be connected.
But if we follow that procedure with:

connect(sock, &addr, sizeof(addr));

you'd expect ret == -1 and errno == EISCONN here, but you actually get
ret == 0. If we attempt the connection one more time, then we get EISCON
as expected.

io_uring used to do this, but turns out that bluetooth fails with EBADFD
if you attempt to re-connect. Also looks like EISCONN _could_ occur with
this sequence.

Retain the ->in_progress logic, but work-around a potential EISCONN or
EBADFD error and only in those cases look at the sock_error(). This
should work in general and avoid the odd sequence of a repeated connect
request returning success when the socket is already connected.

This is all a side effect of the socket state being in a CONNECTING
state when we get EINPROGRESS, and only a re-connect or other related
operation will turn that into CONNECTED.

Cc: stable@vger.kernel.org
Fixes: 3fb1bd6881 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT")
Link: https://github.com/axboe/liburing/issues/980
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-20 11:57:25 +01:00
Gabriel Krisman Bertazi cafff2cfb7 io_uring/kbuf: Allow the full buffer id space for provided buffers
[ Upstream commit f74c746e47 ]

nbufs tracks the number of buffers and not the last bgid. In 16-bit, we
have 2^16 valid buffers, but the check mistakenly rejects the last
bid. Let's fix it to make the interface consistent with the
documentation.

Fixes: ddf0322db7 ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20231005000531.30800-3-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20 11:57:01 +01:00
Gabriel Krisman Bertazi aabcc4c499 io_uring/kbuf: Fix check of BID wrapping in provided buffers
[ Upstream commit ab69838e7c ]

Commit 3851d25c75 ("io_uring: check for rollover of buffer ID when
providing buffers") introduced a check to prevent wrapping the BID
counter when sqe->off is provided, but it's off-by-one too
restrictive, rejecting the last possible BID (65534).

i.e., the following fails with -EINVAL.

     io_uring_prep_provide_buffers(sqe, addr, size, 0xFFFF, 0, 0);

Fixes: 3851d25c75 ("io_uring: check for rollover of buffer ID when providing buffers")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20231005000531.30800-2-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20 11:57:01 +01:00
Al Viro fd652af1e1 io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed
[ Upstream commit 1939316bf9 ]

->ki_pos value is unreliable in such cases.  For an obvious example,
consider O_DSYNC write - we feed the data to page cache and start IO,
then we make sure it's completed.  Update of ->ki_pos is dealt with
by the first part; failure in the second ends up with negative value
returned _and_ ->ki_pos left advanced as if sync had been successful.
In the same situation write(2) does not advance the file position
at all.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-08 14:09:03 +01:00
Jens Axboe 56e08ccc96 io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid
commit 7644b1a1c9 upstream.

We could race with SQ thread exit, and if we do, we'll hit a NULL pointer
dereference when the thread is cleared. Grab the SQPOLL data lock before
attempting to get the task cpu and pid for fdinfo, this ensures we have a
stable view of it.

Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218032
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-02 09:36:56 +01:00
Jeff Moyer 054dfb821c io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls()
[ Upstream commit 0f8baa3c98 ]

I received a bug report with the following signature:

[ 1759.937637] BUG: unable to handle page fault for address: ffffffffffffffe8
[ 1759.944564] #PF: supervisor read access in kernel mode
[ 1759.949732] #PF: error_code(0x0000) - not-present page
[ 1759.954901] PGD 7ab615067 P4D 7ab615067 PUD 7ab617067 PMD 0
[ 1759.960596] Oops: 0000 1 PREEMPT SMP PTI
[ 1759.964804] CPU: 15 PID: 109 Comm: cpuhp/15 Kdump: loaded Tainted: G X ------- — 5.14.0-362.3.1.el9_3.x86_64 #1
[ 1759.976609] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/20/2018
[ 1759.985181] RIP: 0010:io_wq_for_each_worker.isra.0+0x24/0xa0
[ 1759.990877] Code: 90 90 90 90 90 90 0f 1f 44 00 00 41 56 41 55 41 54 55 48 8d 6f 78 53 48 8b 47 78 48 39 c5 74 4f 49 89 f5 49 89 d4 48 8d 58 e8 <8b> 13 85 d2 74 32 8d 4a 01 89 d0 f0 0f b1 0b 75 5c 09 ca 78 3d 48
[ 1760.009758] RSP: 0000:ffffb6f403603e20 EFLAGS: 00010286
[ 1760.015013] RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: 0000000000000000
[ 1760.022188] RDX: ffffb6f403603e50 RSI: ffffffffb11e95b0 RDI: ffff9f73b09e9400
[ 1760.029362] RBP: ffff9f73b09e9478 R08: 000000000000000f R09: 0000000000000000
[ 1760.036536] R10: ffffffffffffff00 R11: ffffb6f403603d80 R12: ffffb6f403603e50
[ 1760.043712] R13: ffffffffb11e95b0 R14: ffffffffb28531e8 R15: ffff9f7a6fbdf548
[ 1760.050887] FS: 0000000000000000(0000) GS:ffff9f7a6fbc0000(0000) knlGS:0000000000000000
[ 1760.059025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1760.064801] CR2: ffffffffffffffe8 CR3: 00000007ab610002 CR4: 00000000007706e0
[ 1760.071976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1760.079150] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1760.086325] PKRU: 55555554
[ 1760.089044] Call Trace:
[ 1760.091501] <TASK>
[ 1760.093612] ? show_trace_log_lvl+0x1c4/0x2df
[ 1760.097995] ? show_trace_log_lvl+0x1c4/0x2df
[ 1760.102377] ? __io_wq_cpu_online+0x54/0xb0
[ 1760.106584] ? __die_body.cold+0x8/0xd
[ 1760.110356] ? page_fault_oops+0x134/0x170
[ 1760.114479] ? kernelmode_fixup_or_oops+0x84/0x110
[ 1760.119298] ? exc_page_fault+0xa8/0x150
[ 1760.123247] ? asm_exc_page_fault+0x22/0x30
[ 1760.127458] ? __pfx_io_wq_worker_affinity+0x10/0x10
[ 1760.132453] ? __pfx_io_wq_worker_affinity+0x10/0x10
[ 1760.137446] ? io_wq_for_each_worker.isra.0+0x24/0xa0
[ 1760.142527] __io_wq_cpu_online+0x54/0xb0
[ 1760.146558] cpuhp_invoke_callback+0x109/0x460
[ 1760.151029] ? __pfx_io_wq_cpu_offline+0x10/0x10
[ 1760.155673] ? __pfx_smpboot_thread_fn+0x10/0x10
[ 1760.160320] cpuhp_thread_fun+0x8d/0x140
[ 1760.164266] smpboot_thread_fn+0xd3/0x1a0
[ 1760.168297] kthread+0xdd/0x100
[ 1760.171457] ? __pfx_kthread+0x10/0x10
[ 1760.175225] ret_from_fork+0x29/0x50
[ 1760.178826] </TASK>
[ 1760.181022] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat dm_multipath intel_rapl_msr intel_rapl_common isst_if_common ipmi_ssif nfit libnvdimm mgag200 i2c_algo_bit ioatdma drm_shmem_helper drm_kms_helper acpi_ipmi syscopyarea x86_pkg_temp_thermal sysfillrect ipmi_si intel_powerclamp sysimgblt ipmi_devintf coretemp acpi_power_meter ipmi_msghandler rapl pcspkr dca intel_pch_thermal intel_cstate ses lpc_ich intel_uncore enclosure hpilo mei_me mei acpi_tad fuse drm xfs sd_mod sg bnx2x nvme nvme_core crct10dif_pclmul crc32_pclmul nvme_common ghash_clmulni_intel smartpqi tg3 t10_pi mdio uas libcrc32c crc32c_intel scsi_transport_sas usb_storage hpwdt wmi dm_mirror dm_region_hash dm_log dm_mod
[ 1760.248623] CR2: ffffffffffffffe8

A cpu hotplug callback was issued before wq->all_list was initialized.
This results in a null pointer dereference.  The fix is to fully setup
the io_wq before calling cpuhp_state_add_instance_nocalls().

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Link: https://lore.kernel.org/r/x49y1ghnecs.fsf@segfault.boston.devel.redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25 12:16:20 +02:00
Jens Axboe 7053f66e94 io_uring: fix crash with IORING_SETUP_NO_MMAP and invalid SQ ring address
commit 8b51a3956d upstream.

If we specify a valid CQ ring address but an invalid SQ ring address,
we'll correctly spot this and free the allocated pages and clear them
to NULL. However, we don't clear the ring page count, and hence will
attempt to free the pages again. We've already cleared the address of
the page array when freeing them, but we don't check for that. This
causes the following crash:

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
Oops [#1]
Modules linked in:
CPU: 0 PID: 20 Comm: kworker/u2:1 Not tainted 6.6.0-rc5-dirty #56
Hardware name: ucbbar,riscvemu-bare (DT)
Workqueue: events_unbound io_ring_exit_work
epc : io_pages_free+0x2a/0x58
 ra : io_rings_free+0x3a/0x50
 epc : ffffffff808811a2 ra : ffffffff80881406 sp : ffff8f80000c3cd0
 status: 0000000200000121 badaddr: 0000000000000000 cause: 000000000000000d
 [<ffffffff808811a2>] io_pages_free+0x2a/0x58
 [<ffffffff80881406>] io_rings_free+0x3a/0x50
 [<ffffffff80882176>] io_ring_exit_work+0x37e/0x424
 [<ffffffff80027234>] process_one_work+0x10c/0x1f4
 [<ffffffff8002756e>] worker_thread+0x252/0x31c
 [<ffffffff8002f5e4>] kthread+0xc4/0xe0
 [<ffffffff8000332a>] ret_from_fork+0xa/0x1c

Check for a NULL array in io_pages_free(), but also clear the page counts
when we free them to be on the safer side.

Reported-by: rtm@csail.mit.edu
Fixes: 03d89a2de2 ("io_uring: support for user allocated memory for rings/sqes")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25 12:16:12 +02:00
Jens Axboe 79a487128c io_uring: ensure io_lockdep_assert_cq_locked() handles disabled rings
commit 1658633c04 upstream.

io_lockdep_assert_cq_locked() checks that locking is correctly done when
a CQE is posted. If the ring is setup in a disabled state with
IORING_SETUP_R_DISABLED, then ctx->submitter_task isn't assigned until
the ring is later enabled. We generally don't post CQEs in this state,
as no SQEs can be submitted. However it is possible to generate a CQE
if tagged resources are being updated. If this happens and PROVE_LOCKING
is enabled, then the locking check helper will dereference
ctx->submitter_task, which hasn't been set yet.

Fixup io_lockdep_assert_cq_locked() to handle this case correctly. While
at it, convert it to a static inline as well, so that generated line
offsets will actually reflect which condition failed, rather than just
the line offset for io_lockdep_assert_cq_locked() itself.

Reported-and-tested-by: syzbot+efc45d4e7ba6ab4ef1eb@syzkaller.appspotmail.com
Fixes: f26cc95935 ("io_uring: lockdep annotate CQ locking")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10 22:02:57 +02:00
Jens Axboe fa6a7e0d8b io_uring/kbuf: don't allow registered buffer rings on highmem pages
commit f8024f1f36 upstream.

syzbot reports that registering a mapped buffer ring on arm32 can
trigger an OOPS. Registered buffer rings have two modes, one of them
is the application passing in the memory that the buffer ring should
reside in. Once those pages are mapped, we use page_address() to get
a virtual address. This will obviously fail on highmem pages, which
aren't mapped.

Add a check if we have any highmem pages after mapping, and fail the
attempt to register a provided buffer ring if we do. This will return
the same error as kernels that don't support provided buffer rings to
begin with.

Link: https://lore.kernel.org/io-uring/000000000000af635c0606bcb889@google.com/
Fixes: c56e022c0a ("io_uring: add support for user mapped provided buffer ring")
Cc: stable@vger.kernel.org
Reported-by: syzbot+2113e61b8848fa7951d8@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10 22:02:57 +02:00
Jens Axboe 3f3164ce63 io_uring: don't allow IORING_SETUP_NO_MMAP rings on highmem pages
commit 223ef47431 upstream.

On at least arm32, but presumably any arch with highmem, if the
application passes in memory that resides in highmem for the rings,
then we should fail that ring creation. We fail it with -EINVAL, which
is what kernels that don't support IORING_SETUP_NO_MMAP will do as well.

Cc: stable@vger.kernel.org
Fixes: 03d89a2de2 ("io_uring: support for user allocated memory for rings/sqes")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10 22:02:56 +02:00
Jens Axboe a4f5f1e846 io_uring/fs: remove sqe->rw_flags checking from LINKAT
commit a52d4f6575 upstream.

This is unionized with the actual link flags, so they can of course be
set and they will be evaluated further down. If not we fail any LINKAT
that has to set option flags.

Fixes: cf30da90bc ("io_uring: add support for IORING_OP_LINKAT")
Cc: stable@vger.kernel.org
Reported-by: Thomas Leonard <talex5@gmail.com>
Link: https://github.com/axboe/liburing/issues/955
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06 13:16:23 +02:00
Pavel Begunkov 880dff7fd8 io_uring/net: fix iter retargeting for selected buf
commit c21a8027ad upstream.

When using selected buffer feature, io_uring delays data iter setup
until later. If io_setup_async_msg() is called before that it might see
not correctly setup iterator. Pre-init nr_segs and judge from its state
whether we repointing.

Cc: stable@vger.kernel.org
Reported-by: syzbot+a4c6e5ef999b68b26ed1@syzkaller.appspotmail.com
Fixes: 0455d4ccec ("io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0000000000002770be06053c7757@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23 11:14:34 +02:00
Jens Axboe f1cd525fbb io_uring: annotate the struct io_kiocb slab for appropriate user copy
[ Upstream commit b97f96e22f ]

When compiling the kernel with clang and having HARDENED_USERCOPY
enabled, the liburing openat2.t test case fails during request setup:

usercopy: Kernel memory overwrite attempt detected to SLUB object 'io_kiocb' (offset 24, size 24)!
------------[ cut here ]------------
kernel BUG at mm/usercopy.c:102!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 3 PID: 413 Comm: openat2.t Tainted: G                 N 6.4.3-g6995e2de6891-dirty #19
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
RIP: 0010:usercopy_abort+0x84/0x90
Code: ce 49 89 ce 48 c7 c3 68 48 98 82 48 0f 44 de 48 c7 c7 56 c6 94 82 4c 89 de 48 89 c1 41 52 41 56 53 e8 e0 51 c5 00 48 83 c4 18 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 41 57 41 56
RSP: 0018:ffffc900016b3da0 EFLAGS: 00010296
RAX: 0000000000000062 RBX: ffffffff82984868 RCX: 4e9b661ac6275b00
RDX: ffff8881b90ec580 RSI: ffffffff82949a64 RDI: 00000000ffffffff
RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000
R10: ffffc900016b3c88 R11: ffffc900016b3c30 R12: 00007ffe549659e0
R13: ffff888119014000 R14: 0000000000000018 R15: 0000000000000018
FS:  00007f862e3ca680(0000) GS:ffff8881b90c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005571483542a8 CR3: 0000000118c11000 CR4: 00000000003506e0
Call Trace:
 <TASK>
 ? __die_body+0x63/0xb0
 ? die+0x9d/0xc0
 ? do_trap+0xa7/0x180
 ? usercopy_abort+0x84/0x90
 ? do_error_trap+0xc6/0x110
 ? usercopy_abort+0x84/0x90
 ? handle_invalid_op+0x2c/0x40
 ? usercopy_abort+0x84/0x90
 ? exc_invalid_op+0x2f/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? usercopy_abort+0x84/0x90
 __check_heap_object+0xe2/0x110
 __check_object_size+0x142/0x3d0
 io_openat2_prep+0x68/0x140
 io_submit_sqes+0x28a/0x680
 __se_sys_io_uring_enter+0x120/0x580
 do_syscall_64+0x3d/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x55714834de26
Code: ca 01 0f b6 82 d0 00 00 00 8b ba cc 00 00 00 45 31 c0 31 d2 41 b9 08 00 00 00 83 e0 01 c1 e0 04 41 09 c2 b8 aa 01 00 00 0f 05 <c3> 66 0f 1f 84 00 00 00 00 00 89 30 eb 89 0f 1f 40 00 8b 00 a8 06
RSP: 002b:00007ffe549659c8 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
RAX: ffffffffffffffda RBX: 00007ffe54965a50 RCX: 000055714834de26
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
R10: 0000000000000000 R11: 0000000000000246 R12: 000055714834f057
R13: 00007ffe54965a50 R14: 0000000000000001 R15: 0000557148351dd8
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---

when it tries to copy struct open_how from userspace into the per-command
space in the io_kiocb. There's nothing wrong with the copy, but we're
missing the appropriate annotations for allowing user copies to/from the
io_kiocb slab.

Allow copies in the per-command area, which is from the 'file' pointer to
when 'opcode' starts. We do have existing user copies there, but they are
not all annotated like the one that openat2_prep() uses,
copy_struct_from_user(). But in practice opcodes should be allowed to
copy data into their per-command area in the io_kiocb.

Reported-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23 11:14:25 +02:00
Gabriel Krisman Bertazi d25033932c io_uring: Don't set affinity on a dying sqpoll thread
commit bd6fc5da4c upstream.

Syzbot reported a null-ptr-deref of sqd->thread inside
io_sqpoll_wq_cpu_affinity.  It turns out the sqd->thread can go away
from under us during io_uring_register, in case the process gets a
fatal signal during io_uring_register.

It is not particularly hard to hit the race, and while I am not sure
this is the exact case hit by syzbot, it solves it.  Finally, checking
->thread is enough to close the race because we locked sqd while
"parking" the thread, thus preventing it from going away.

I reproduced it fairly consistently with a program that does:

int main(void) {
  ...
  io_uring_queue_init(RING_LEN, &ring1, IORING_SETUP_SQPOLL);
  while (1) {
    io_uring_register_iowq_aff(ring, 1, &mask);
  }
}

Executed in a loop with timeout to trigger SIGTERM:
  while true; do timeout 1 /a.out ; done

This will hit the following BUG() in very few attempts.

BUG: kernel NULL pointer dereference, address: 00000000000007a8
PGD 800000010e949067 P4D 800000010e949067 PUD 10e46e067 PMD 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 15715 Comm: dead-sqpoll Not tainted 6.5.0-rc7-next-20230825-g193296236fa0-dirty #23
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:io_sqpoll_wq_cpu_affinity+0x27/0x70
Code: 90 90 90 0f 1f 44 00 00 55 53 48 8b 9f 98 03 00 00 48 85 db 74 4f
48 89 df 48 89 f5 e8 e2 f8 ff ff 48 8b 43 38 48 85 c0 74 22 <48> 8b b8
a8 07 00 00 48 89 ee e8 ba b1 00 00 48 89 df 89 c5 e8 70
RSP: 0018:ffffb04040ea7e70 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff93c010749e40 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffffffa7653331 RDI: 00000000ffffffff
RBP: ffffb04040ea7eb8 R08: 0000000000000000 R09: c0000000ffffdfff
R10: ffff93c01141b600 R11: ffffb04040ea7d18 R12: ffff93c00ea74840
R13: 0000000000000011 R14: 0000000000000000 R15: ffff93c00ea74800
FS:  00007fb7c276ab80(0000) GS:ffff93c36f200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000007a8 CR3: 0000000111634003 CR4: 0000000000370ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? __die_body+0x1a/0x60
 ? page_fault_oops+0x154/0x440
 ? do_user_addr_fault+0x174/0x7b0
 ? exc_page_fault+0x63/0x140
 ? asm_exc_page_fault+0x22/0x30
 ? io_sqpoll_wq_cpu_affinity+0x27/0x70
 __io_register_iowq_aff+0x2b/0x60
 __io_uring_register+0x614/0xa70
 __x64_sys_io_uring_register+0xaa/0x1a0
 do_syscall_64+0x3a/0x90
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7fb7c226fec9
Code: 2e 00 b8 ca 00 00 00 0f 05 eb a5 66 0f 1f 44 00 00 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d 97 7f 2d 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe2c0674f8 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb7c226fec9
RDX: 00007ffe2c067530 RSI: 0000000000000011 RDI: 0000000000000003
RBP: 00007ffe2c0675d0 R08: 00007ffe2c067550 R09: 00007ffe2c067550
R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffe2c067750 R14: 0000000000000000 R15: 0000000000000000
 </TASK>
Modules linked in:
CR2: 00000000000007a8
---[ end trace 0000000000000000 ]---

Reported-by: syzbot+c74fea926a78b8a91042@syzkaller.appspotmail.com
Fixes: ebdfefc09c ("io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/87v8cybuo6.fsf@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13 09:53:52 +02:00
Pavel Begunkov f16c01edeb io_uring: break out of iowq iopoll on teardown
commit 45500dc4e0 upstream.

io-wq will retry iopoll even when it failed with -EAGAIN. If that
races with task exit, which sets TIF_NOTIFY_SIGNAL for all its workers,
such workers might potentially infinitely spin retrying iopoll again and
again and each time failing on some allocation / waiting / etc. Don't
keep spinning if io-wq is dying.

Fixes: 561fb04a6a ("io_uring: replace workqueue usage with io-wq")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13 09:53:51 +02:00
Pavel Begunkov e88b673aed io_uring/net: don't overflow multishot accept
commit 1bfed23349 upstream.

Don't allow overflowing multishot accept CQEs, we want to limit
the grows of the overflow list.

Cc: stable@vger.kernel.org
Fixes: 4e86a2c980 ("io_uring: implement multishot mode for accept")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7d0d749649244873772623dd7747966f516fe6e2.1691757663.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13 09:53:51 +02:00
Pavel Begunkov 1e2db9837b io_uring/net: don't overflow multishot recv
commit b2e74db55d upstream.

Don't allow overflowing multishot recv CQEs, it might get out of
hand, hurt performance, and in the worst case scenario OOM the task.

Cc: stable@vger.kernel.org
Fixes: b3fdea6ecb ("io_uring: multishot recv")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0b295634e8f1b71aa764c984608c22d85f88f75c.1691757663.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13 09:53:51 +02:00
Jens Axboe 6aee0822a0 io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used
commit ebdfefc09c upstream.

If we setup the ring with SQPOLL, then that polling thread has its
own io-wq setup. This means that if the application uses
IORING_REGISTER_IOWQ_AFF to set the io-wq affinity, we should not be
setting it for the invoking task, but rather the sqpoll task.

Add an sqpoll helper that parks the thread and updates the affinity,
and use that one if we're using SQPOLL.

Fixes: fe76421d1d ("io_uring: allow user configurable IO thread CPU affinity")
Cc: stable@vger.kernel.org # 5.10+
Link: https://github.com/axboe/liburing/discussions/884
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13 09:53:51 +02:00
Pavel Begunkov c392179ed5 io_uring: break iopolling on signal
commit dc314886cb upstream.

Don't keep spinning iopoll with a signal set. It'll eventually return
back, e.g. by virtue of need_resched(), but it's not a nice user
experience.

Cc: stable@vger.kernel.org
Fixes: def596e955 ("io_uring: support for IO polling")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/eeba551e82cad12af30c3220125eb6cb244cc94c.1691594339.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13 09:53:51 +02:00
Pavel Begunkov 0d2bb32800 io_uring: fix false positive KASAN warnings
commit 569f5308e5 upstream.

io_req_local_work_add() peeks into the work list, which can be executed
in the meanwhile. It's completely fine without KASAN as we're in an RCU
read section and it's SLAB_TYPESAFE_BY_RCU. With KASAN though it may
trigger a false positive warning because internal io_uring caches are
sanitised.

Remove sanitisation from the io_uring request cache for now.

Cc: stable@vger.kernel.org
Fixes: 8751d15426 ("io_uring: reduce scheduling due to tw")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c6fbf7a82a341e66a0007c76eefd9d57f2d3ba51.1691541473.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13 09:53:51 +02:00
Pavel Begunkov aa149eddad io_uring: fix drain stalls by invalid SQE
[ Upstream commit cfdbaa3a29 ]

cq_extra is protected by ->completion_lock, which io_get_sqe() misses.
The bug is harmless as it doesn't happen in real life, requires invalid
SQ index array and racing with submission, and only messes up the
userspace, i.e. stall requests execution but will be cleaned up on
ring destruction.

Fixes: 15641e4270 ("io_uring: don't cache number of dropped SQEs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/66096d54651b1a60534bb2023f2947f09f50ef73.1691538547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13 09:53:23 +02:00
Helge Deller 56675f8b9f io_uring/parisc: Adjust pgoff in io_uring mmap() for parisc
The changes from commit 32832a407a ("io_uring: Fix io_uring mmap() by
using architecture-provided get_unmapped_area()") to the parisc
implementation of get_unmapped_area() broke glibc's locale-gen
executable when running on parisc.

This patch reverts those architecture-specific changes, and instead
adjusts in io_uring_mmu_get_unmapped_area() the pgoff offset which is
then given to parisc's get_unmapped_area() function.  This is much
cleaner than the previous approach, and we still will get a coherent
addresss.

This patch has no effect on other architectures (SHM_COLOUR is only
defined on parisc), and the liburing testcase stil passes on parisc.

Cc: stable@vger.kernel.org # 6.4
Signed-off-by: Helge Deller <deller@gmx.de>
Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
Fixes: 32832a407a ("io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()")
Fixes: d808459b2e ("io_uring: Adjust mapping wrt architecture aliasing requirements")
Link: https://lore.kernel.org/r/ZNEyGV0jyI8kOOfz@p100
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-08 12:37:01 -06:00
Aleksa Sarai 72dbde0f2a io_uring: correct check for O_TMPFILE
O_TMPFILE is actually __O_TMPFILE|O_DIRECTORY. This means that the old
check for whether RESOLVE_CACHED can be used would incorrectly think
that O_DIRECTORY could not be used with RESOLVE_CACHED.

Cc: stable@vger.kernel.org # v5.12+
Fixes: 3a81fd0204 ("io_uring: enable LOOKUP_CACHED path resolution for filename lookups")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20230807-resolve_cached-o_tmpfile-v3-1-e49323e1ef6f@cyphar.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-07 12:34:23 -06:00
Linus Torvalds 9c65505826 io_uring-6.5-2023-07-28
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTD2W8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprV3EADCRTtNrDWDbOFjAgfpfzzdRvumc4ANjFZ9
 h8xM9SIt9JBKvD6Lt8w825nYpBvGiK39qmnHssSgKz2Wna0gLM46ocKNQ6ilGWY1
 jP2rIazBpoEHPrR/yecpt4apWhv96c73hh25SISab2B1BBDYpI+WuUp5Hmwfe9Ya
 qAUSxwvlDtw5N4/tgJNNcu25nxWur3ACplMG5zUzp+ruVaJCHQlKsuXgTO5Ae8Nw
 rxkEUb3g+hPAcExofEgvPZjvQFr7k7jFexHC74XgZZznqrUZFnEtIXHwF+44iFb1
 /HgYZrcmQ8Wb2G3mD42fpuu5aMaQlghgJT13tEVZ8sQsqt965wIN5tkxp0CQHjPM
 P1bBEHa3g1I7U80CshKwNoawP0T8teRSKjHj4o62QvFKoH4VH58bImCVby5MwTs1
 eSB3DN3121rfwUY7gNzpdCnFIlpp3jSrSLMkMGmoV5iJuTRkIgpGJ3+uty5lesYs
 DmTkrzguRiuUK8NP5CUvsL7FPZmcTjWU4BXujbC4WCt0Uc4pXZUBIIUWB7AwOBTC
 FHy0W0A3H9pY5yyKyHfKUlDe7CnIgLXDstdznArtbn374R+j7R04dSWlntlg2bIk
 ZDl9wZ0MmQPMXCKXXk8Pc0EcamXt8Tp12Xb0VtF6ETPbDF0Xn6oSSkIIYda7z0WP
 PXjhTzIssQ==
 =+ctz
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.5-2023-07-28' of git://git.kernel.dk/linux

Pull io_uring fix from Jens Axboe:
 "Just a single tweak to a patch from last week, to avoid having idle
  cqring waits be attributed as iowait"

* tag 'io_uring-6.5-2023-07-28' of git://git.kernel.dk/linux:
  io_uring: gate iowait schedule on having pending requests
2023-07-28 10:19:44 -07:00
Jens Axboe 7b72d661f1 io_uring: gate iowait schedule on having pending requests
A previous commit made all cqring waits marked as iowait, as a way to
improve performance for short schedules with pending IO. However, for
use cases that have a special reaper thread that does nothing but
wait on events on the ring, this causes a cosmetic issue where we
know have one core marked as being "busy" with 100% iowait.

While this isn't a grave issue, it is confusing to users. Rather than
always mark us as being in iowait, gate setting of current->in_iowait
to 1 by whether or not the waiting task has pending requests.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/io-uring/CAMEGJJ2RxopfNQ7GNLhr7X9=bHXKo+G5OOe0LUq=+UgLXsv1Xg@mail.gmail.com/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217699
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217700
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Reported-by: Phil Elwell <phil@raspberrypi.com>
Tested-by: Andres Freund <andres@anarazel.de>
Fixes: 8a796565ce ("io_uring: Use io_schedule* in cqring wait")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 11:44:35 -06:00
Linus Torvalds bdd1d82e7d io_uring-6.5-2023-07-21
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmS62/kQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuYYD/9HMf8IG/2i/EmfgHa2O0yyaBprITcMAnCx
 3x5szTz2Q4V5Dwdib1xRNAiT5KF4uoio8JJJWeIjeHTucUnVA7hjwAma4qmmVNlM
 9dVelH5FL85oNEMVlIqpJ2KHWHH3FaQAyvQvOIXODr+DNAOn+NlgL18B6W5g43Xt
 e8/d+vmDflr3KfliNOfADdZebJzB6hR497btScr9bcmVNuxHnz8tPUCugdLz5LQh
 GEGW+iIUIS/Xmfrv2cleO5ANQ245gcCyBBGV/3klj8LndJpeKstFp3dl8mTA372/
 ZLXnckKQerErKEcuvB8XJ3JIjlQR6s4JeCBk8fBi04ibHdRQuSDQgNXoFIKuRI2o
 A4znCK7P/TP7yK1dn7PLFaPtsfUkoUOSuTeMPkh/f8iW/ckurPu7WTQTd8dcyXDj
 50U09Aa1V7TpnTCGkfQGTlX5ihlvRHQhGdqUOnR7NuS/KOaFqxpK3X1fuMwoea+s
 I+BC1hDnSm5uJTzcTY/G4OBXyujlslTbD2bq4Gmpb9JkZVkzhvIpjYHiEmZQJrC2
 EvyI0z8MLARjHia6snn5EOJqrX7EsV7GmYcoWadQh/h1zt5TPyEyY0S4u6HrwtwB
 Rlp/dH60VbrRzoOHVzOr53F505F3DJ/woZXKGIuqTSBNtYudJKkfUbWY4Ga23pPw
 +i1JhigYgg==
 =FL//
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.5-2023-07-21' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Fix for io-wq not always honoring REQ_F_NOWAIT, if it was set and
   punted directly (eg via DRAIN) (me)

 - Capability check fix (Ondrej)

 - Regression fix for the mmap changes that went into 6.4, which
   apparently broke IA64 (Helge)

* tag 'io_uring-6.5-2023-07-21' of git://git.kernel.dk/linux:
  ia64: mmap: Consider pgoff when searching for free mapping
  io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()
  io_uring: treat -EAGAIN for REQ_F_NOWAIT as final for io-wq
  io_uring: don't audit the capability check in io_uring_create()
2023-07-22 10:46:30 -07:00
Helge Deller 32832a407a io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()
The io_uring testcase is broken on IA-64 since commit d808459b2e
("io_uring: Adjust mapping wrt architecture aliasing requirements").

The reason is, that this commit introduced an own architecture
independend get_unmapped_area() search algorithm which finds on IA-64 a
memory region which is outside of the regular memory region used for
shared userspace mappings and which can't be used on that platform
due to aliasing.

To avoid similar problems on IA-64 and other platforms in the future,
it's better to switch back to the architecture-provided
get_unmapped_area() function and adjust the needed input parameters
before the call. Beside fixing the issue, the function now becomes
easier to understand and maintain.

This patch has been successfully tested with the io_uring testcase on
physical x86-64, ppc64le, IA-64 and PA-RISC machines. On PA-RISC the LTP
mmmap testcases did not report any regressions.

Cc: stable@vger.kernel.org # 6.4
Signed-off-by: Helge Deller <deller@gmx.de>
Reported-by: matoro <matoro_mailinglist_kernel@matoro.tk>
Fixes: d808459b2e ("io_uring: Adjust mapping wrt architecture aliasing requirements")
Link: https://lore.kernel.org/r/20230721152432.196382-2-deller@gmx.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-21 09:41:29 -06:00
Jens Axboe a9be202269 io_uring: treat -EAGAIN for REQ_F_NOWAIT as final for io-wq
io-wq assumes that an issue is blocking, but it may not be if the
request type has asked for a non-blocking attempt. If we get
-EAGAIN for that case, then we need to treat it as a final result
and not retry or arm poll for it.

Cc: stable@vger.kernel.org # 5.10+
Link: https://github.com/axboe/liburing/issues/897
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-20 13:16:53 -06:00
Ondrej Mosnacek 6adc2272aa io_uring: don't audit the capability check in io_uring_create()
The check being unconditional may lead to unwanted denials reported by
LSMs when a process has the capability granted by DAC, but denied by an
LSM. In the case of SELinux such denials are a problem, since they can't
be effectively filtered out via the policy and when not silenced, they
produce noise that may hide a true problem or an attack.

Since not having the capability merely means that the created io_uring
context will be accounted against the current user's RLIMIT_MEMLOCK
limit, we can disable auditing of denials for this check by using
ns_capable_noaudit() instead of capable().

Fixes: 2b188cc1bb ("Add io_uring IO interface")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2193317
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Link: https://lore.kernel.org/r/20230718115607.65652-1-omosnace@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-18 14:16:25 -06:00
Linus Torvalds ec17f16432 io_uring-6.5-2023-07-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSxpEEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvydEADdrmbazqc/wuwobCcdSXrl3q+Xp51hqtYs
 QdPhkig3WqDh3Jrl+9b0D+6+Ff5xvgxP/n+5og6Yae+asHqpXUerVV/9srFX8aj2
 jtWyYpHIHxf09+hOlkD65ZKMnceuofqleQ9dT/6TFSHzC36HLjdINANx4kvR7A/z
 L96gXXM6plQHLxA7FGMORma+48/EfvaRjAR6YEojYL76AoeZS9UEWi/F5ihjHMz3
 v4F9J5CiOfSIu8YJ+BfFeu9j7f0m9BHH1PJjyw21R9jeqLO+slrLbIBObKuaPqXe
 sLfkjAjIzUpBwstQcdWr6VUgfIiV5++aHBXqpH6X4v6Gt3roOSNjMQ++GdKZD+WC
 4VVsqOPGG5DzLNCz6V0FFfWMmjqeWvI3/O3ssGIvCkLcWaVO3m9eLvPrQ15WBQT2
 KJ+6Cs8VJlCvXd6pjhDFBEtAhzJGj8w+uzpuTPXiiz6/r3p5DQ3oE7cwXICR/TDj
 rrqckgtsQBItVtDNkgrhaCstx4lGFZhfspdPirpFUu6L6sqeMWWDFWceUiR4NBla
 Xij14t/5bv0cWnNaH9rzDTuV7fD5jgPHPc48gjSbnwTlHOXu5xt4xPd0RXnoVhDa
 zjsVL7nhZCb/YOw2nKOf6kNhIXRJ7hdWuXJBkYlCgEbAWQJsu6q6v4xaFzHsIQJa
 nR649rUo7w==
 =UPCD
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.5-2023-07-14' of git://git.kernel.dk/linux

Pull io_uring fix from Jens Axboe:
 "Just a single tweak for the wait logic in io_uring"

* tag 'io_uring-6.5-2023-07-14' of git://git.kernel.dk/linux:
  io_uring: Use io_schedule* in cqring wait
2023-07-14 19:46:54 -07:00
Andres Freund 8a796565ce io_uring: Use io_schedule* in cqring wait
I observed poor performance of io_uring compared to synchronous IO. That
turns out to be caused by deeper CPU idle states entered with io_uring,
due to io_uring using plain schedule(), whereas synchronous IO uses
io_schedule().

The losses due to this are substantial. On my cascade lake workstation,
t/io_uring from the fio repository e.g. yields regressions between 20%
and 40% with the following command:
./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0

This is repeatable with different filesystems, using raw block devices
and using different block devices.

Use io_schedule_prepare() / io_schedule_finish() in
io_cqring_wait_schedule() to address the difference.

After that using io_uring is on par or surpassing synchronous IO (using
registered files etc makes it reliably win, but arguably is a less fair
comparison).

There are other calls to schedule() in io_uring/, but none immediately
jump out to be similarly situated, so I did not touch them. Similarly,
it's possible that mutex_lock_io() should be used, but it's not clear if
there are cases where that matters.

Cc: stable@vger.kernel.org # 5.10+
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: io-uring@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Andres Freund <andres@anarazel.de>
Link: https://lore.kernel.org/r/20230707162007.194068-1-andres@anarazel.de
[axboe: minor style fixup]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-07 11:24:29 -06:00
Linus Torvalds 4f52875366 io_uring-6.5-2023-07-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSjJ3MQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgph7DD/9GJGbifGaS54YGkTA5P5mSrSKnEKmDqg6Z
 4KqDlKRZUxf0e5Bi4i8Y72stf7RQo22O9ElibK0OK9DDcwzOOcvTTXTusWlTQ84t
 mufWwQECARoTESZzATYtEsmNLeS1ltVv17RuiN/OL5IkD5WzzCGkMnAYa6KHUn6q
 88YTQL6LN58Kol33Z2hab7tPxrIoigNDGo0Qvi2kWiO7ORERLAc3VyQ1MeV/Y5Sz
 jzdUGiHNkZYrh4YGKWAVqU9CnTvmmHAcmzs/UNK4m4PDp+OdofzZ3Q4VGq+Xb3CR
 y7camog7DfFd76ymlZppEF1JUDMeeKLDF43qqu9/nDjHZKL+jkG3vb7zIPOEC3fh
 s92CBAGyu49UiLzekfH4dWmYZ1i8coRS6RvGhxSWGxgs6a2C7nBF0+mbd1WQ2m/7
 yE7ojo5dZgz97bELRSqGJTS5zuavfYYJ5KDZ7kK8WaI1Iaen2AyMbFnfIqPfRjed
 5sAalpA5xD+76JjjY/6VaeNvocmxSACtprhAuC781sncjq2SLiGXiYm5XxhFDPlK
 Gen4F5Jc0qHQToXByoZFkPDxDBv1Itazx6x3nQ+QlcFxgrs0fByari8gLYk6LBtw
 tLp3hfCF307kt71+EgzHLCDthqHFWgCvFdLwnqT0FUdAPW+LrSqlk1jLI6BZD87z
 d0tAgIR0Ng==
 =/j96
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.5-2023-07-03' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:
 "The fix for the msghdr->msg_inq assigned value being wrong, using -1
  instead of -1U for the signed type.

  Also a fix for ensuring when we're trying to run task_work on an
  exiting task, that we wait for it. This is not really a correctness
  thing as the work is being canceled, but it does help with ensuring
  file descriptors are closed when the task has exited."

* tag 'io_uring-6.5-2023-07-03' of git://git.kernel.dk/linux:
  io_uring: flush offloaded and delayed task_work on exit
  io_uring: remove io_fallback_tw() forward declaration
  io_uring/net: use proper value for msg_inq
2023-07-03 18:43:10 -07:00
Linus Torvalds 18c9901d74 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmScTY8ACgkQnJ2qBz9k
 QNnfwwgAhZtow1klDLH6qCnWMufB6AwT7VfAHyA3fvyTYMjUKb0sGHkpuh8hqVOb
 Lzb4YB+jSWV8XnMFn/4gFJQU/nAv8bMPavghMGpr5VNjQi7WkxYF/GB6O1I5NOHK
 EnJjDExgdxXDJZORaaXLVJWrtzJuDFgdiSeIwJECFa0MdTHNgPy3XOl+PPxnYQ/V
 xyHyP5ImGgd5O4iy3PFDQBGgOXIMrBX8IMce+qLQNYIvjSIUgmdnIkoUCvsQiisp
 LyKI2LxqAqnpA4h4Ow6hOZDw2VlPT0vDwFVUfFIZMIqs5YgaSbWa1Z6cs37MigAn
 fgUyRVx2y8A2Lwla7rwLaUEToRVADw==
 =ZdcG
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:

 - Support for fanotify events returning file handles for filesystems
   not exportable via NFS

 - Improved error handling exportfs functions

 - Add missing FS_OPEN events when unusual open helpers are used

* tag 'fsnotify_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: move fsnotify_open() hook into do_dentry_open()
  exportfs: check for error return value from exportfs_encode_*()
  fanotify: support reporting non-decodeable file handles
  exportfs: allow exporting non-decodeable file handles to userspace
  exportfs: add explicit flag to request non-decodeable file handles
  exportfs: change connectable argument to bit flags
2023-06-29 13:31:44 -07:00
Linus Torvalds 3a8a670eee Networking changes for 6.5.
Core
 ----
 
  - Rework the sendpage & splice implementations. Instead of feeding
    data into sockets page by page extend sendmsg handlers to support
    taking a reference on the data, controlled by a new flag called
    MSG_SPLICE_PAGES. Rework the handling of unexpected-end-of-file
    to invoke an additional callback instead of trying to predict what
    the right combination of MORE/NOTLAST flags is.
    Remove the MSG_SENDPAGE_NOTLAST flag completely.
 
  - Implement SCM_PIDFD, a new type of CMSG type analogous to
    SCM_CREDENTIALS, but it contains pidfd instead of plain pid.
 
  - Enable socket busy polling with CONFIG_RT.
 
  - Improve reliability and efficiency of reporting for ref_tracker.
 
  - Auto-generate a user space C library for various Netlink families.
 
 Protocols
 ---------
 
  - Allow TCP to shrink the advertised window when necessary, prevent
    sk_rcvbuf auto-tuning from growing the window all the way up to
    tcp_rmem[2].
 
  - Use per-VMA locking for "page-flipping" TCP receive zerocopy.
 
  - Prepare TCP for device-to-device data transfers, by making sure
    that payloads are always attached to skbs as page frags.
 
  - Make the backoff time for the first N TCP SYN retransmissions
    linear. Exponential backoff is unnecessarily conservative.
 
  - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO).
 
  - Avoid waking up applications using TLS sockets until we have
    a full record.
 
  - Allow using kernel memory for protocol ioctl callbacks, paving
    the way to issuing ioctls over io_uring.
 
  - Add nolocalbypass option to VxLAN, forcing packets to be fully
    encapsulated even if they are destined for a local IP address.
 
  - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
    in-kernel ECMP implementation (e.g. Open vSwitch) select the same
    link for all packets. Support L4 symmetric hashing in Open vSwitch.
 
  - PPPoE: make number of hash bits configurable.
 
  - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
    (ipconfig).
 
  - Add layer 2 miss indication and filtering, allowing higher layers
    (e.g. ACL filters) to make forwarding decisions based on whether
    packet matched forwarding state in lower devices (bridge).
 
  - Support matching on Connectivity Fault Management (CFM) packets.
 
  - Hide the "link becomes ready" IPv6 messages by demoting their
    printk level to debug.
 
  - HSR: don't enable promiscuous mode if device offloads the proto.
 
  - Support active scanning in IEEE 802.15.4.
 
  - Continue work on Multi-Link Operation for WiFi 7.
 
 BPF
 ---
 
  - Add precision propagation for subprogs and callbacks. This allows
    maintaining verification efficiency when subprograms are used,
    or in fact passing the verifier at all for complex programs,
    especially those using open-coded iterators.
 
  - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
    assumed the length is always equal to the amount of written data.
    But some protos allow passing a NULL buffer to discover what
    the output buffer *should* be, without writing anything.
 
  - Accept dynptr memory as memory arguments passed to helpers.
 
  - Add routing table ID to bpf_fib_lookup BPF helper.
 
  - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands.
 
  - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
    maps as read-only).
 
  - Show target_{obj,btf}_id in tracing link fdinfo.
 
  - Addition of several new kfuncs (most of the names are self-explanatory):
    - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
      bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
      and bpf_dynptr_clone().
    - bpf_task_under_cgroup()
    - bpf_sock_destroy() - force closing sockets
    - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
 
 Netfilter
 ---------
 
  - Relax set/map validation checks in nf_tables. Allow checking
    presence of an entry in a map without using the value.
 
  - Increase ip_vs_conn_tab_bits range for 64BIT builds.
 
  - Allow updating size of a set.
 
  - Improve NAT tuple selection when connection is closing.
 
 Driver API
 ----------
 
  - Integrate netdev with LED subsystem, to allow configuring HW
    "offloaded" blinking of LEDs based on link state and activity
    (i.e. packets coming in and out).
 
  - Support configuring rate selection pins of SFP modules.
 
  - Factor Clause 73 auto-negotiation code out of the drivers, provide
    common helper routines.
 
  - Add more fool-proof helpers for managing lifetime of MDIO devices
    associated with the PCS layer.
 
  - Allow drivers to report advanced statistics related to Time Aware
    scheduler offload (taprio).
 
  - Allow opting out of VF statistics in link dump, to allow more VFs
    to fit into the message.
 
  - Split devlink instance and devlink port operations.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Synopsys EMAC4 IP support (stmmac)
    - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
    - Marvell 88E6250 7 port switches
    - Microchip LAN8650/1 Rev.B0 PHYs
    - MediaTek MT7981/MT7988 built-in 1GE PHY driver
 
  - WiFi:
    - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
    - Realtek RTL8723DS (SDIO variant)
    - Realtek RTL8851BE
 
  - CAN:
    - Fintek F81604
 
 Drivers
 -------
 
  - Ethernet NICs:
    - Intel (100G, ice):
      - support dynamic interrupt allocation
      - use meta data match instead of VF MAC addr on slow-path
    - nVidia/Mellanox:
      - extend link aggregation to handle 4, rather than just 2 ports
      - spawn sub-functions without any features by default
    - OcteonTX2:
      - support HTB (Tx scheduling/QoS) offload
      - make RSS hash generation configurable
      - support selecting Rx queue using TC filters
    - Wangxun (ngbe/txgbe):
      - add basic Tx/Rx packet offloads
      - add phylink support (SFP/PCS control)
    - Freescale/NXP (enetc):
      - report TAPRIO packet statistics
    - Solarflare/AMD:
      - support matching on IP ToS and UDP source port of outer header
      - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
      - add devlink dev info support for EF10
 
  - Virtual NICs:
    - Microsoft vNIC:
      - size the Rx indirection table based on requested configuration
      - support VLAN tagging
    - Amazon vNIC:
      - try to reuse Rx buffers if not fully consumed, useful for ARM
        servers running with 16kB pages
    - Google vNIC:
      - support TCP segmentation of >64kB frames
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - enable USXGMII (88E6191X)
    - Microchip:
     - lan966x: add support for Egress Stage 0 ACL engine
     - lan966x: support mapping packet priority to internal switch
       priority (based on PCP or DSCP)
 
  - Ethernet PHYs:
    - Broadcom PHYs:
      - support for Wake-on-LAN for BCM54210E/B50212E
      - report LPI counter
    - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
    - Micrel PHYs: receive timestamp in the frame (LAN8841)
    - Realtek PHYs: support optional external PHY clock
    - Altera TSE PCS: merge the driver into Lynx PCS which it is
      a variant of
 
  - CAN: Kvaser PCIEcan:
    - support packet timestamping
 
  - WiFi:
    - Intel (iwlwifi):
      - major update for new firmware and Multi-Link Operation (MLO)
      - configuration rework to drop test devices and split
        the different families
      - support for segmented PNVM images and power tables
      - new vendor entries for PPAG (platform antenna gain) feature
    - Qualcomm 802.11ax (ath11k):
      - Multiple Basic Service Set Identifier (MBSSID) and
        Enhanced MBSSID Advertisement (EMA) support in AP mode
      - support factory test mode
    - RealTek (rtw89):
      - add RSSI based antenna diversity
      - support U-NII-4 channels on 5 GHz band
    - RealTek (rtl8xxxu):
      - AP mode support for 8188f
      - support USB RX aggregation for the newer chips
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmSbJM4ACgkQMUZtbf5S
 IrtoDhAAhEim1+LBIKf4lhPcVdZ2p/TkpnwTz5jsTwSeRBAxTwuNJ2fQhFXg13E3
 MnRq6QaEp8G4/tA/gynLvQop+FEZEnv+horP0zf/XLcC8euU7UrKdrpt/4xxdP07
 IL/fFWsoUGNO+L9LNaHwBo8g7nHvOkPscHEBHc2Xrvzab56TJk6vPySfLqcpKlNZ
 CHWDwTpgRqNZzSKiSpoMVd9OVMKUXcPYHpDmfEJ5l+e8vTXmZzOLHrSELHU5nP5f
 mHV7gxkDCTshoGcaed7UTiOvgu1p6E5EchDJxiLaSUbgsd8SZ3u4oXwRxgj33RK/
 fB2+UaLrRt/DdlHvT/Ph8e8Ygu77yIXMjT49jsfur/zVA0HEA2dFb7V6QlsYRmQp
 J25pnrdXmE15llgqsC0/UOW5J1laTjII+T2T70UOAqQl4LWYAQDG4WwsAqTzU0KY
 dueydDouTp9XC2WYrRUEQxJUzxaOaazskDUHc5c8oHp/zVBT+djdgtvVR9+gi6+7
 yy4elI77FlEEqL0ItdU/lSWINayAlPLsIHkMyhSGKX0XDpKjeycPqkNx4UterXB/
 JKIR5RBWllRft+igIngIkKX0tJGMU0whngiw7d1WLw25wgu4sB53hiWWoSba14hv
 tXMxwZs5iGaPcT38oRVMZz8I1kJM4Dz3SyI7twVvi4RUut64EG4=
 =9i4I
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking changes from Jakub Kicinski:
 "WiFi 7 and sendpage changes are the biggest pieces of work for this
  release. The latter will definitely require fixes but I think that we
  got it to a reasonable point.

  Core:

   - Rework the sendpage & splice implementations

     Instead of feeding data into sockets page by page extend sendmsg
     handlers to support taking a reference on the data, controlled by a
     new flag called MSG_SPLICE_PAGES

     Rework the handling of unexpected-end-of-file to invoke an
     additional callback instead of trying to predict what the right
     combination of MORE/NOTLAST flags is

     Remove the MSG_SENDPAGE_NOTLAST flag completely

   - Implement SCM_PIDFD, a new type of CMSG type analogous to
     SCM_CREDENTIALS, but it contains pidfd instead of plain pid

   - Enable socket busy polling with CONFIG_RT

   - Improve reliability and efficiency of reporting for ref_tracker

   - Auto-generate a user space C library for various Netlink families

  Protocols:

   - Allow TCP to shrink the advertised window when necessary, prevent
     sk_rcvbuf auto-tuning from growing the window all the way up to
     tcp_rmem[2]

   - Use per-VMA locking for "page-flipping" TCP receive zerocopy

   - Prepare TCP for device-to-device data transfers, by making sure
     that payloads are always attached to skbs as page frags

   - Make the backoff time for the first N TCP SYN retransmissions
     linear. Exponential backoff is unnecessarily conservative

   - Create a new MPTCP getsockopt to retrieve all info
     (MPTCP_FULL_INFO)

   - Avoid waking up applications using TLS sockets until we have a full
     record

   - Allow using kernel memory for protocol ioctl callbacks, paving the
     way to issuing ioctls over io_uring

   - Add nolocalbypass option to VxLAN, forcing packets to be fully
     encapsulated even if they are destined for a local IP address

   - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
     in-kernel ECMP implementation (e.g. Open vSwitch) select the same
     link for all packets. Support L4 symmetric hashing in Open vSwitch

   - PPPoE: make number of hash bits configurable

   - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
     (ipconfig)

   - Add layer 2 miss indication and filtering, allowing higher layers
     (e.g. ACL filters) to make forwarding decisions based on whether
     packet matched forwarding state in lower devices (bridge)

   - Support matching on Connectivity Fault Management (CFM) packets

   - Hide the "link becomes ready" IPv6 messages by demoting their
     printk level to debug

   - HSR: don't enable promiscuous mode if device offloads the proto

   - Support active scanning in IEEE 802.15.4

   - Continue work on Multi-Link Operation for WiFi 7

  BPF:

   - Add precision propagation for subprogs and callbacks. This allows
     maintaining verification efficiency when subprograms are used, or
     in fact passing the verifier at all for complex programs,
     especially those using open-coded iterators

   - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
     assumed the length is always equal to the amount of written data.
     But some protos allow passing a NULL buffer to discover what the
     output buffer *should* be, without writing anything

   - Accept dynptr memory as memory arguments passed to helpers

   - Add routing table ID to bpf_fib_lookup BPF helper

   - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands

   - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
     maps as read-only)

   - Show target_{obj,btf}_id in tracing link fdinfo

   - Addition of several new kfuncs (most of the names are
     self-explanatory):
      - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
        bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
        and bpf_dynptr_clone().
      - bpf_task_under_cgroup()
      - bpf_sock_destroy() - force closing sockets
      - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs

  Netfilter:

   - Relax set/map validation checks in nf_tables. Allow checking
     presence of an entry in a map without using the value

   - Increase ip_vs_conn_tab_bits range for 64BIT builds

   - Allow updating size of a set

   - Improve NAT tuple selection when connection is closing

  Driver API:

   - Integrate netdev with LED subsystem, to allow configuring HW
     "offloaded" blinking of LEDs based on link state and activity
     (i.e. packets coming in and out)

   - Support configuring rate selection pins of SFP modules

   - Factor Clause 73 auto-negotiation code out of the drivers, provide
     common helper routines

   - Add more fool-proof helpers for managing lifetime of MDIO devices
     associated with the PCS layer

   - Allow drivers to report advanced statistics related to Time Aware
     scheduler offload (taprio)

   - Allow opting out of VF statistics in link dump, to allow more VFs
     to fit into the message

   - Split devlink instance and devlink port operations

  New hardware / drivers:

   - Ethernet:
      - Synopsys EMAC4 IP support (stmmac)
      - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
      - Marvell 88E6250 7 port switches
      - Microchip LAN8650/1 Rev.B0 PHYs
      - MediaTek MT7981/MT7988 built-in 1GE PHY driver

   - WiFi:
      - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
      - Realtek RTL8723DS (SDIO variant)
      - Realtek RTL8851BE

   - CAN:
      - Fintek F81604

  Drivers:

   - Ethernet NICs:
      - Intel (100G, ice):
         - support dynamic interrupt allocation
         - use meta data match instead of VF MAC addr on slow-path
      - nVidia/Mellanox:
         - extend link aggregation to handle 4, rather than just 2 ports
         - spawn sub-functions without any features by default
      - OcteonTX2:
         - support HTB (Tx scheduling/QoS) offload
         - make RSS hash generation configurable
         - support selecting Rx queue using TC filters
      - Wangxun (ngbe/txgbe):
         - add basic Tx/Rx packet offloads
         - add phylink support (SFP/PCS control)
      - Freescale/NXP (enetc):
         - report TAPRIO packet statistics
      - Solarflare/AMD:
         - support matching on IP ToS and UDP source port of outer
           header
         - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
         - add devlink dev info support for EF10

   - Virtual NICs:
      - Microsoft vNIC:
         - size the Rx indirection table based on requested
           configuration
         - support VLAN tagging
      - Amazon vNIC:
         - try to reuse Rx buffers if not fully consumed, useful for ARM
           servers running with 16kB pages
      - Google vNIC:
         - support TCP segmentation of >64kB frames

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - enable USXGMII (88E6191X)
      - Microchip:
         - lan966x: add support for Egress Stage 0 ACL engine
         - lan966x: support mapping packet priority to internal switch
           priority (based on PCP or DSCP)

   - Ethernet PHYs:
      - Broadcom PHYs:
         - support for Wake-on-LAN for BCM54210E/B50212E
         - report LPI counter
      - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
      - Micrel PHYs: receive timestamp in the frame (LAN8841)
      - Realtek PHYs: support optional external PHY clock
      - Altera TSE PCS: merge the driver into Lynx PCS which it is a
        variant of

   - CAN: Kvaser PCIEcan:
      - support packet timestamping

   - WiFi:
      - Intel (iwlwifi):
         - major update for new firmware and Multi-Link Operation (MLO)
         - configuration rework to drop test devices and split the
           different families
         - support for segmented PNVM images and power tables
         - new vendor entries for PPAG (platform antenna gain) feature
      - Qualcomm 802.11ax (ath11k):
         - Multiple Basic Service Set Identifier (MBSSID) and Enhanced
           MBSSID Advertisement (EMA) support in AP mode
         - support factory test mode
      - RealTek (rtw89):
         - add RSSI based antenna diversity
         - support U-NII-4 channels on 5 GHz band
      - RealTek (rtl8xxxu):
         - AP mode support for 8188f
         - support USB RX aggregation for the newer chips"

* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
  net: scm: introduce and use scm_recv_unix helper
  af_unix: Skip SCM_PIDFD if scm->pid is NULL.
  net: lan743x: Simplify comparison
  netlink: Add __sock_i_ino() for __netlink_diag_dump().
  net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
  Revert "af_unix: Call scm_recv() only after scm_set_cred()."
  phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
  libceph: Partially revert changes to support MSG_SPLICE_PAGES
  net: phy: mscc: fix packet loss due to RGMII delays
  net: mana: use vmalloc_array and vcalloc
  net: enetc: use vmalloc_array and vcalloc
  ionic: use vmalloc_array and vcalloc
  pds_core: use vmalloc_array and vcalloc
  gve: use vmalloc_array and vcalloc
  octeon_ep: use vmalloc_array and vcalloc
  net: usb: qmi_wwan: add u-blox 0x1312 composition
  perf trace: fix MSG_SPLICE_PAGES build error
  ipvlan: Fix return value of ipvlan_queue_xmit()
  netfilter: nf_tables: fix underflow in chain reference counter
  netfilter: nf_tables: unbind non-anonymous set if rule construction fails
  ...
2023-06-28 16:43:10 -07:00
Linus Torvalds 6e17c6de3d - Yosry Ahmed brought back some cgroup v1 stats in OOM logs.
- Yosry has also eliminated cgroup's atomic rstat flushing.
 
 - Nhat Pham adds the new cachestat() syscall.  It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability.
 
 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning.
 
 - Lorenzo Stoakes has done some maintanance work on the get_user_pages()
   interface.
 
 - Liam Howlett continues with cleanups and maintenance work to the maple
   tree code.  Peng Zhang also does some work on maple tree.
 
 - Johannes Weiner has done some cleanup work on the compaction code.
 
 - David Hildenbrand has contributed additional selftests for
   get_user_pages().
 
 - Thomas Gleixner has contributed some maintenance and optimization work
   for the vmalloc code.
 
 - Baolin Wang has provided some compaction cleanups,
 
 - SeongJae Park continues maintenance work on the DAMON code.
 
 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting.
 
 - Christoph Hellwig has some cleanups for the filemap/directio code.
 
 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the provided
   APIs rather than open-coding accesses.
 
 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings.
 
 - John Hubbard has a series of fixes to the MM selftesting code.
 
 - ZhangPeng continues the folio conversion campaign.
 
 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock.
 
 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
   128 to 8.
 
 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management.
 
 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code.
 
 - Vishal Moola also has done some folio conversion work.
 
 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
 joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
 J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
 =B7yQ
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm updates from Andrew Morton:

 - Yosry Ahmed brought back some cgroup v1 stats in OOM logs

 - Yosry has also eliminated cgroup's atomic rstat flushing

 - Nhat Pham adds the new cachestat() syscall. It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability

 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning

 - Lorenzo Stoakes has done some maintanance work on the
   get_user_pages() interface

 - Liam Howlett continues with cleanups and maintenance work to the
   maple tree code. Peng Zhang also does some work on maple tree

 - Johannes Weiner has done some cleanup work on the compaction code

 - David Hildenbrand has contributed additional selftests for
   get_user_pages()

 - Thomas Gleixner has contributed some maintenance and optimization
   work for the vmalloc code

 - Baolin Wang has provided some compaction cleanups,

 - SeongJae Park continues maintenance work on the DAMON code

 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting

 - Christoph Hellwig has some cleanups for the filemap/directio code

 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the
   provided APIs rather than open-coding accesses

 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings

 - John Hubbard has a series of fixes to the MM selftesting code

 - ZhangPeng continues the folio conversion campaign

 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock

 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
   from 128 to 8

 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management

 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code

 - Vishal Moola also has done some folio conversion work

 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch

* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
  mm/hugetlb: remove hugetlb_set_page_subpool()
  mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
  hugetlb: revert use of page_cache_next_miss()
  Revert "page cache: fix page_cache_next/prev_miss off by one"
  mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
  mm: memcg: rename and document global_reclaim()
  mm: kill [add|del]_page_to_lru_list()
  mm: compaction: convert to use a folio in isolate_migratepages_block()
  mm: zswap: fix double invalidate with exclusive loads
  mm: remove unnecessary pagevec includes
  mm: remove references to pagevec
  mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
  mm: remove struct pagevec
  net: convert sunrpc from pagevec to folio_batch
  i915: convert i915_gpu_error to use a folio_batch
  pagevec: rename fbatch_count()
  mm: remove check_move_unevictable_pages()
  drm: convert drm_gem_put_pages() to use a folio_batch
  i915: convert shmem_sg_free_table() to use a folio_batch
  scatterlist: add sg_set_folio()
  ...
2023-06-28 10:28:11 -07:00
Jens Axboe dfbe5561ae io_uring: flush offloaded and delayed task_work on exit
io_uring offloads task_work for cancelation purposes when the task is
exiting. This is conceptually fine, but we should be nicer and actually
wait for that work to complete before returning.

Add an argument to io_fallback_tw() telling it to flush the deferred
work when it's all queued up, and have it flush a ctx behind whenever
the ctx changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-28 11:06:05 -06:00
Jens Axboe 10e1c0d590 io_uring: remove io_fallback_tw() forward declaration
It's used just one function higher up, get rid of the declaration and
just move it up a bit.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-27 16:07:24 -06:00
Jens Axboe b65db9211e io_uring/net: use proper value for msg_inq
struct msghdr->msg_inq is a signed type, yet we attempt to store what
is essentially an unsigned bitmask in there. We only really need to know
if the field was stored or not, but let's use the proper type to avoid
any misunderstandings on what is being attempted here.

Link: https://lore.kernel.org/io-uring/CAHk-=wjKb24aSe6fE4zDH-eh8hr-FB9BbukObUVSMGOrsBHCRQ@mail.gmail.com/
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-27 16:07:24 -06:00
Linus Torvalds 0aa69d53ac for-6.5/io_uring-2023-06-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8cEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnvZD/0QWstFCe1CSLWaycdC9fhWftFt3hyEIST5
 CYEL56UZrDWNkv9xTLe855xvMavjd4sdHlUa8NUPghRQeJyKYgRxHBLXRWmy0uNN
 l47Zjiwsolmbr3Nt6qViLdCDYmG39ZGNwWo8b6p3ybWYLtzxeOblocOBTPzoCtkS
 hjo7Z0eMONvsvLX+l0o9IDdWtZIQ2fGo4VYkMIVb6CyxRPpuUuPKbE25qaTx+uBg
 Fy6Qa3SlaTwzqcg3dttggjIP792L/eETUCWGndg5pJNbrkj/fI4Vm4bEljID76eS
 HODl+pWHmyM6avVkypr7N3Tp5HKF0OTUa4vJTLIZo1QiRu6zlXphtuvGn+McEmgV
 hbYmQMYWzqJ22k2iEpCR58pdhmZJC9uB8r4Rwgr/t9GKqt4E+15EzmqkG9cUVMGV
 rfbBwVLwBUd5+0WHwQ8RzdtaUPt17vSIW/8WhU5zoMGVotqVBHO/H+5BtmKPWWpq
 fx1etQ8XJVPIxziJvgsEitb1s6KZzJspcONDlLEitmZkflv3gGdVm99KNbXwJpcp
 m6+FcYQ5d5FivfLPGgpx8go+4M2QuoW2yRGwZHu54buCnpxgNjIk898OjrUrdXCg
 3/0m99GXmOWQQl0VrrTr+Fv99nVsQ2hMQzOFJGMYRtHEEc5xiTcJiZmoxmF7T7/C
 TipyW3czsw==
 =5Me8
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:
 "Nothing major in this release, just a bunch of cleanups and some
  optimizations around networking mostly.

   - clean up file request flags handling (Christoph)

   - clean up request freeing and CQ locking (Pavel)

   - support for using pre-registering the io_uring fd at setup time
     (Josh)

   - Add support for user allocated ring memory, rather than having the
     kernel allocate it. Mostly for packing rings into a huge page (me)

   - avoid an unnecessary double retry on receive (me)

   - maintain ordering for task_work, which also improves performance
     (me)

   - misc cleanups/fixes (Pavel, me)"

* tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux: (39 commits)
  io_uring: merge conditional unlock flush helpers
  io_uring: make io_cq_unlock_post static
  io_uring: inline __io_cq_unlock
  io_uring: fix acquire/release annotations
  io_uring: kill io_cq_unlock()
  io_uring: remove IOU_F_TWQ_FORCE_NORMAL
  io_uring: don't batch task put on reqs free
  io_uring: move io_clean_op()
  io_uring: inline io_dismantle_req()
  io_uring: remove io_free_req_tw
  io_uring: open code io_put_req_find_next
  io_uring: add helpers to decode the fixed file file_ptr
  io_uring: use io_file_from_index in io_msg_grab_file
  io_uring: use io_file_from_index in __io_sync_cancel
  io_uring: return REQ_F_ flags from io_file_get_flags
  io_uring: remove io_req_ffs_set
  io_uring: remove a confusing comment above io_file_get_flags
  io_uring: remove the mode variable in io_file_get_flags
  io_uring: remove __io_file_supports_nowait
  io_uring: wait interruptibly for request completions on exit
  ...
2023-06-26 12:30:26 -07:00
Pavel Begunkov c98c81a4ac io_uring: merge conditional unlock flush helpers
There is no reason not to use __io_cq_unlock_post_flush for intermediate
aux CQE flushing, all ->task_complete should apply there, i.e. if set it
should be the submitter task. Combine them, get rid of of
__io_cq_unlock_post() and rename the left function.

This place was also taking a couple percents of CPU according to
profiles for max throughput net benchmarks due to multishot recv
flooding it with completions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bbed60734cbec2e833d9c7bdcf9741aada5d8aab.1687518903.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23 08:19:40 -06:00
Pavel Begunkov 0fdb9a196c io_uring: make io_cq_unlock_post static
io_cq_unlock_post() is exclusively used in io_uring/io_uring.c, mark it
static and don't expose to other files.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3dc8127dda4514e1dd24bb32035faac887c5fa37.1687518903.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23 08:19:40 -06:00
Pavel Begunkov ff12617728 io_uring: inline __io_cq_unlock
__io_cq_unlock is not very helpful, and users should be calling flush
variants anyway. Open code the function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d875c4cfb69f38ccecb58a57111446c77a614caa.1687518903.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23 08:19:40 -06:00
Pavel Begunkov 55b6a69fed io_uring: fix acquire/release annotations
We do conditional locking, so __io_cq_lock() and friends not always
actually grab/release the lock, so kill misleading annotations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2a098f9144c24cab622f8bf90b39f44da5d0401e.1687518903.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23 08:19:40 -06:00
Pavel Begunkov f432b76bcc io_uring: kill io_cq_unlock()
We're abusing ->completion_lock helpers. io_cq_unlock() neither
locking conditionally nor doing CQE flushing, which means that callers
must have some side reason of taking the lock and should do it directly.

Open code io_cq_unlock() into io_cqring_overflow_kill() and clean it up.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7dabb36856db2b562e78780480396c52c29b2bf4.1687518903.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23 08:19:39 -06:00
Pavel Begunkov 91c7884ac9 io_uring: remove IOU_F_TWQ_FORCE_NORMAL
Extract a function for non-local task_work_add, and use it directly from
io_move_task_work_from_local(). Now we don't use IOU_F_TWQ_FORCE_NORMAL
and it can be killed.

As a small positive side effect we don't grab task->io_uring in
io_req_normal_work_add anymore, which is not needed for
io_req_local_work_add().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2e55571e8ff2927ae3cc12da606d204e2485525b.1687518903.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23 08:19:39 -06:00
Pavel Begunkov 2fdd6fb5ff io_uring: don't batch task put on reqs free
We're trying to batch io_put_task() in io_free_batch_list(), but
considering that the hot path is a simple inc, it's most cerainly and
probably faster to just do io_put_task() instead of task tracking.

We don't care about io_put_task_remote() as it's only for IOPOLL
where polling/waiting is done by not the submitter task.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4a7ef7dce845fe2bd35507bf389d6bd2d5c1edf0.1687518903.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23 08:19:39 -06:00
Pavel Begunkov 5a754dea27 io_uring: move io_clean_op()
Move io_clean_op() up in the source file and remove the forward
declaration, as the function doesn't have tricky dependencies
anymore.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1b7163b2ba7c3a8322d972c79c1b0a9301b3057e.1687518903.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23 08:19:39 -06:00
Pavel Begunkov 3b7a612fd0 io_uring: inline io_dismantle_req()
io_dismantle_req() is only used in __io_req_complete_post(), open code
it there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ba8f20cb2c914eefa2e7d120a104a198552050db.1687518903.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23 08:19:39 -06:00
Pavel Begunkov 6ec9afc7f4 io_uring: remove io_free_req_tw
Request completion is a very hot path in general, but there are 3 places
that can be doing it: io_free_batch_list(), io_req_complete_post() and
io_free_req_tw().

io_free_req_tw() is used rather marginally and we don't care about it.
Killing it can help to clean up and optimise the left two, do that by
replacing it with io_req_task_complete().

There are two things to consider:
1) io_free_req() is called when all refs are put, so we need to reinit
   references. The easiest way to do that is to clear REQ_F_REFCOUNT.
2) We also don't need a cqe from it, so silence it with REQ_F_CQE_SKIP.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/434a2be8f33d474ad888ce1c17fe5ea7bbcb2a55.1687518903.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-23 08:19:39 -06:00