linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-13 14:14:37 +00:00

Author	SHA1	Message	Date
Jens Axboe	14a1143b68	io_uring: add support for IORING_OP_UNLINKAT IORING_OP_UNLINKAT behaves like unlinkat(2) and takes the same flags and arguments. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Jens Axboe	80a261fd00	io_uring: add support for IORING_OP_RENAMEAT IORING_OP_RENAMEAT behaves like renameat2(), and takes the same flags etc. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Jens Axboe	14587a4664	io_uring: enable file table usage for SQPOLL rings Now that SQPOLL supports non-registered files and grabs the file table, we can relax the restriction on open/close/accept/connect and allow them on a ring that is setup with IORING_SETUP_SQPOLL. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Jens Axboe	28cea78af4	io_uring: allow non-fixed files with SQPOLL The restriction of needing fixed files for SQPOLL is problematic, and prevents/inhibits several valid uses cases. With the referenced files_struct that we have now, it's trivially supportable. Treat ->files like we do the mm for the SQPOLL thread - grab a reference to it (and assign it), and drop it when we're done. This feature is exposed as IORING_FEAT_SQPOLL_NONFIXED. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:54 -07:00
Hillf Danton	f26c08b444	io_uring: fix file leak on error path of io ctx creation Put file as part of error handling when setting up io ctx to fix memory leaks like the following one. BUG: memory leak unreferenced object 0xffff888101ea2200 (size 256): comm "syz-executor355", pid 8470, jiffies 4294953658 (age 32.400s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 20 59 03 01 81 88 ff ff 80 87 a8 10 81 88 ff ff Y.............. backtrace: [<000000002e0a7c5f>] kmem_cache_zalloc include/linux/slab.h:654 [inline] [<000000002e0a7c5f>] __alloc_file+0x1f/0x130 fs/file_table.c:101 [<000000001a55b73a>] alloc_empty_file+0x69/0x120 fs/file_table.c:151 [<00000000fb22349e>] alloc_file+0x33/0x1b0 fs/file_table.c:193 [<000000006e1465bb>] alloc_file_pseudo+0xb2/0x140 fs/file_table.c:233 [<000000007118092a>] anon_inode_getfile fs/anon_inodes.c:91 [inline] [<000000007118092a>] anon_inode_getfile+0xaa/0x120 fs/anon_inodes.c:74 [<000000002ae99012>] io_uring_get_fd fs/io_uring.c:9198 [inline] [<000000002ae99012>] io_uring_create fs/io_uring.c:9377 [inline] [<000000002ae99012>] io_uring_setup+0x1125/0x1630 fs/io_uring.c:9411 [<000000008280baad>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 [<00000000685d8cf0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported-by: syzbot+71c4697e27c99fddcf17@syzkaller.appspotmail.com Fixes: `0f2122045b` ("io_uring: don't rely on weak ->files references") Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-08 08:54:26 -07:00
Pavel Begunkov	e8c954df23	io_uring: fix mis-seting personality's creds After io_identity_cow() copies an work.identity it wants to copy creds to the new just allocated id, not the old one. Otherwise it's akin to req->work.identity->creds = req->work.identity->creds. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-07 08:43:44 -07:00
Florent Revest	dba4a9256b	net: Remove the err argument from sock_from_file Currently, the sock_from_file prototype takes an "err" pointer that is either not set or set to -ENOTSOCK IFF the returned socket is NULL. This makes the error redundant and it is ignored by a few callers. This patch simplifies the API by letting callers deduce the error based on whether the returned socket is NULL or not. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Florent Revest <revest@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: KP Singh <kpsingh@google.com> Link: https://lore.kernel.org/bpf/20201204113609.1850150-1-revest@google.com	2020-12-04 22:32:40 +01:00
Christoph Hellwig	4e7b5671c6	block: remove i_bdev Switch the block device lookup interfaces to directly work with a dev_t so that struct block_device references are only acquired by the blkdev_get variants (and the blk-cgroup special case). This means that we now don't need an extra reference in the inode and can generally simplify handling of struct block_device to keep the lookups contained in the core block layer code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Coly Li <colyli@suse.de> [bcache] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-01 14:53:39 -07:00
Pavel Begunkov	2d280bc893	io_uring: fix recvmsg setup with compat buf-select __io_compat_recvmsg_copy_hdr() with REQ_F_BUFFER_SELECT reads out iov len but never assigns it to iov/fast_iov, leaving sr->len with garbage. Hopefully, following io_buffer_select() truncates it to the selected buffer size, but the value is still may be under what was specified. Cc: <stable@vger.kernel.org> # 5.7 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-30 11:12:03 -07:00
Pavel Begunkov	af60470347	io_uring: fix files grab/cancel race When one task is in io_uring_cancel_files() and another is doing io_prep_async_work() a race may happen. That's because after accounting a request inflight in first call to io_grab_identity() it still may fail and go to io_identity_cow(), which migh briefly keep dangling work.identity and not only. Grab files last, so io_prep_async_work() won't fail if it did get into ->inflight_list. note: the bug shouldn't exist after making io_uring_cancel_files() not poking into other tasks' requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-26 08:50:21 -07:00
Pavel Begunkov	9c3a205c5f	io_uring: fix ITER_BVEC check iov_iter::type is a bitmask that also keeps direction etc., so it shouldn't be directly compared against ITER_*. Use proper helper. Fixes: `ff6165b2d7` ("io_uring: retain iov_iter state over io_read/io_write calls") Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Cc: <stable@vger.kernel.org> # 5.9 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-24 07:54:30 -07:00
Joseph Qi	eb2667b343	io_uring: fix shift-out-of-bounds when round up cq size Abaci Fuzz reported a shift-out-of-bounds BUG in io_uring_create(): [ 59.598207] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 [ 59.599665] shift exponent 64 is too large for 64-bit type 'long unsigned int' [ 59.601230] CPU: 0 PID: 963 Comm: a.out Not tainted 5.10.0-rc4+ #3 [ 59.602502] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 59.603673] Call Trace: [ 59.604286] dump_stack+0x107/0x163 [ 59.605237] ubsan_epilogue+0xb/0x5a [ 59.606094] __ubsan_handle_shift_out_of_bounds.cold+0xb2/0x20e [ 59.607335] ? lock_downgrade+0x6c0/0x6c0 [ 59.608182] ? rcu_read_lock_sched_held+0xaf/0xe0 [ 59.609166] io_uring_create.cold+0x99/0x149 [ 59.610114] io_uring_setup+0xd6/0x140 [ 59.610975] ? io_uring_create+0x2510/0x2510 [ 59.611945] ? lockdep_hardirqs_on_prepare+0x286/0x400 [ 59.613007] ? syscall_enter_from_user_mode+0x27/0x80 [ 59.614038] ? trace_hardirqs_on+0x5b/0x180 [ 59.615056] do_syscall_64+0x2d/0x40 [ 59.615940] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 59.617007] RIP: 0033:0x7f2bb8a0b239 This is caused by roundup_pow_of_two() if the input entries larger enough, e.g. 2^32-1. For sq_entries, it will check first and we allow at most IORING_MAX_ENTRIES, so it is okay. But for cq_entries, we do round up first, that may overflow and truncate it to 0, which is not the expected behavior. So check the cq size first and then do round up. Fixes: `88ec3211e4` ("io_uring: round-up cq size before comparing with rounded sq size") Reported-by: Abaci Fuzz <abaci@linux.alibaba.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-24 07:54:30 -07:00
Jens Axboe	36f4fa6886	io_uring: add support for shutdown(2) This adds support for the shutdown(2) system call, which is useful for dealing with sockets. shutdown(2) may block, so we have to punt it to async context. Suggested-by: Norman Maurer <norman.maurer@googlemail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-23 09:15:15 -07:00
Jens Axboe	ce59fc69b1	io_uring: allow SQPOLL with CAP_SYS_NICE privileges CAP_SYS_ADMIN is too restrictive for a lot of uses cases, allow CAP_SYS_NICE based on the premise that such users are already allowed to raise the priority of tasks. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-23 09:15:15 -07:00
Linus Torvalds	fa5fca78bb	io_uring-5.10-2020-11-20 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+4DAwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphdOD/9xOEnYPuekvVH9G9nyNd//Q9fPArG2+j6V /MCnze07GNtDt7z15oR+T07hKXmf+Ejh4nu3JJ6MUNfe/47hhJqHSxRHU6+PJCjk hPrsaTsDedxxLEDiLmvhXnUPzfVzJtefxVAAaKikWOb3SBqLdh7xTFSlor1HbRBl Zk4d343cjBDYfvSSt/zMWDzwwvramdz7rJnnPMKXITu64ITL5314vuK2YVZmBOet YujSah7J8FL1jKhiG1Iw5rayd2Q3smnHWIEQ+lvW6WiTvMJMLOxif2xNF4/VEZs1 CBGJUQt42LI6QGEzRBHohcefZFuPGoxnduSzHCOIhh7d6+k+y9mZfsPGohr3g9Ov NotXpVonnA7GbRqzo1+IfBRve7iRONdZ3/LBwyRmqav4I4jX68wXBNH5IDpVR0Sn c31avxa/ZL7iLIBx32enp0/r3mqNTQotEleSLUdyJQXAZTyG2INRhjLLXTqSQ5BX oVp0fZzKCwsr6HCPZpXZ/f2G7dhzuF0ghoceC02GsOVooni22gdVnQj+AWNus398 e+wcimT4MX6AHNFxO2aUtJow0KWWZRzC1p5Mxu/9W3YiMtJiC0YOGePfSqiTqX0g Uk0H5dOAgBUQrAsusf7bKr0K6W25yEk/JipxhWqi0rC71x42mLTsCT1wxSCvLwqs WxhdtVKroQ== =7PAe -----END PGP SIGNATURE----- Merge tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Mostly regression or stable fodder: - Disallow async path resolution of /proc/self - Tighten constraints for segmented async buffered reads - Fix double completion for a retry error case - Fix for fixed file life times (Pavel)" * tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-block: io_uring: order refnode recycling io_uring: get an active ref_node from files_data io_uring: don't double complete failed reissue request mm: never attempt async page lock if we've transferred data already io_uring: handle -EOPNOTSUPP on path resolution proc: don't allow async path resolution of /proc/self components	2020-11-20 11:47:22 -08:00
Pavel Begunkov	e297822b20	io_uring: order refnode recycling Don't recycle a refnode until we're done with all requests of nodes ejected before. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-18 08:02:10 -07:00
Pavel Begunkov	1e5d770bb8	io_uring: get an active ref_node from files_data An active ref_node always can be found in ctx->files_data, it's much safer to get it this way instead of poking into files_data->ref_list. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-18 08:02:10 -07:00
Jens Axboe	c993df5a68	io_uring: don't double complete failed reissue request Zorro reports that an xfstest test case is failing, and it turns out that for the reissue path we can potentially issue a double completion on the request for the failure path. There's an issue around the retry as well, but for now, at least just make sure that we handle the error path correctly. Cc: stable@vger.kernel.org Fixes: `b63534c41e` ("io_uring: re-issue block requests that failed because of resources") Reported-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-17 15:17:29 -07:00
Jens Axboe	944d1444d5	io_uring: handle -EOPNOTSUPP on path resolution Any attempt to do path resolution on /proc/self from an async worker will yield -EOPNOTSUPP. We can safely do that resolution from the task itself, and without blocking, so retry it from there. Ideally io_uring would know this upfront and not have to go through the worker thread to find out, but that doesn't currently seem feasible. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-14 10:22:30 -07:00
Linus Torvalds	f01c30de86	More VFS fixes for 5.10-rc4: - Minor cleanups of the sb_start_* fs freeze helpers. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl+sDaIACgkQ+H93GTRK tOu4sw//bIdBw11YfI9sPtMJR/RkK3lm/pU4A/eJYGD65Mzk8J4kNi6jXKuyqQ8e /RpTqKWOwVW05Qg5HlKTxXRyr5Q788+EuBQH2t8VukWVdAgK2TFvNTTXb7QDsNSD SneC7Sox3CEO+vYnBsr7tUjfl7AYH0uFTxLkvpYqSQBn2+jo2x0s7NyKKZSDAASI +Rmhinw4QjjAHYC54nBy6Q47XhrZJj7XCODJdEql81cKSJUvjCo3url3sNvGXXNW oXbs5IO5cVQrQx6n9rQxCfkN1dz9c/CBopYFwdgmg76Bj4VLSzCYVecnMeDl53pV 3jXesNtJcR2dz64e98K1Moof2dHSm0/NP0Q7KnMYEaGEl6tAtyjSx9lL2Qd6npG+ mG460UHd/7RHXoH/BTaCrtHHyA4pApHMqf+w3R2ienxrltKUJAEfGM/5x8o0ikWx laeT0L/m6Yv/dGnDvNthhoF84tCiQUnxg+UeXiKv4R9uFL1bKMFPw5i1zWuXqqaX yZPqUY1tiecQskr89AimOVI64L2MJ4DgBey1JzNL/XzPtw55Qu+LR6MkkaIC08Wu ubGJTm6fPw3Cz8JYgn4WIgKB9Q7yAoKsyl0mGLQh2SJT1FS8WLct+SRPwXcMVfJT VpkgjJW/ak5L+XfQU6Ev39zUasEAqdaxvPoTxUfne6spUiNbgrk= =ZC9a -----END PGP SIGNATURE----- Merge tag 'vfs-5.10-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull fs freeze fix and cleanups from Darrick Wong: "A single vfs fix for 5.10, along with two subsequent cleanups. A very long time ago, a hack was added to the vfs fs freeze protection code to work around lockdep complaints about XFS, which would try to run a transaction (which requires intwrite protection) to finalize an xfs freeze (by which time the vfs had already taken intwrite). Fast forward a few years, and XFS fixed the recursive intwrite problem on its own, and the hack became unnecessary. Fast forward almost a decade, and latent bugs in the code converting this hack from freeze flags to freeze locks combine with lockdep bugs to make this reproduce frequently enough to notice page faults racing with freeze. Since the hack is unnecessary and causes thread race errors, just get rid of it completely. Making this kind of vfs change midway through a cycle makes me nervous, but a large enough number of the usual VFS/ext4/XFS/btrfs suspects have said this looks good and solves a real problem vector. And once that removal is done, __sb_start_write is now simple enough that it becomes possible to refactor the function into smaller, simpler static inline helpers in linux/fs.h. The cleanup is straightforward. Summary: - Finally remove the "convert to trylock" weirdness in the fs freezer code. It was necessary 10 years ago to deal with nested transactions in XFS, but we've long since removed that; and now this is causing subtle race conditions when lockdep goes offline and sb_start_* aren't prepared to retry a trylock failure. - Minor cleanups of the sb_start_* fs freeze helpers" * tag 'vfs-5.10-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: vfs: move __sb_{start,end}_write* to fs.h vfs: separate __sb_start_write into blocking and non-blocking helpers vfs: remove lockdep bogosity in __sb_start_write	2020-11-13 16:07:53 -08:00
Jens Axboe	88ec3211e4	io_uring: round-up cq size before comparing with rounded sq size If an application specifies IORING_SETUP_CQSIZE to set the CQ ring size to a specific size, we ensure that the CQ size is at least that of the SQ ring size. But in doing so, we compare the already rounded up to power of two SQ size to the as-of yet unrounded CQ size. This means that if an application passes in non power of two sizes, we can return -EINVAL when the final value would've been fine. As an example, an application passing in 100/100 for sq/cq size should end up with 128 for both. But since we round the SQ size first, we compare the CQ size of 100 to 128, and return -EINVAL as that is too small. Cc: stable@vger.kernel.org Fixes: `33a107f0a1` ("io_uring: allow application controlled CQ ring size") Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-11 10:42:41 -07:00
Darrick J. Wong	8a3c84b649	vfs: separate __sb_start_write into blocking and non-blocking helpers Break this function into two helpers so that it's obvious that the trylock versions return a value that must be checked, and the blocking versions don't require that. While we're at it, clean up the return type mismatch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-11-10 16:53:07 -08:00
Pavel Begunkov	9a472ef7a3	io_uring: fix link lookup racing with link timeout We can't just go over linked requests because it may race with linked timeouts. Take ctx->completion_lock in that case. Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-05 15:36:40 -07:00
Jens Axboe	6b47ab81c9	io_uring: use correct pointer for io_uring_show_cred() Previous commit changed how we index the registered credentials, but neglected to update one spot that is used when the personalities are iterated through ->show_fdinfo(). Ensure we use the right struct type for the iteration. Reported-by: syzbot+a6d494688cdb797bdfce@syzkaller.appspotmail.com Fixes: `1e6fa5216a` ("io_uring: COW io_identity on mismatch") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-05 09:50:16 -07:00
Pavel Begunkov	ef9865a442	io_uring: don't forget to task-cancel drained reqs If there is a long-standing request of one task locking up execution of deferred requests, and the defer list contains requests of another task (all files-less), then a potential execution of __io_uring_task_cancel() by that another task will sleep until that first long-standing request completion, and that may take long. E.g. tsk1: req1/read(empty_pipe) -> tsk2: req(DRAIN) Then __io_uring_task_cancel(tsk2) waits for req1 completion. It seems we even can manufacture a complicated case with many tasks sharing many rings that can lock them forever. Cancel deferred requests for __io_uring_task_cancel() as well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-05 09:15:24 -07:00
Pavel Begunkov	99b328084f	io_uring: fix overflowed cancel w/ linked ->files Current io_match_files() check in io_cqring_overflow_flush() is useless because requests drop ->files before going to the overflow list, however linked to it request do not, and we don't check them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-04 10:22:57 -07:00
Jens Axboe	cb8a8ae310	io_uring: drop req/tctx io_identity separately We can't bundle this into one operation, as the identity may not have originated from the tctx to begin with. Drop one ref for each of them separately, if they don't match the static assignment. If we don't, then if the identity is a lookup from registered credentials, we could be freeing that identity as we're dropping a reference assuming it came from the tctx. syzbot reports this as a use-after-free, as the identity is still referencable from idr lookup: ================================================================== BUG: KASAN: use-after-free in instrument_atomic_read_write include/linux/instrumented.h:101 [inline] BUG: KASAN: use-after-free in atomic_fetch_add_relaxed include/asm-generic/atomic-instrumented.h:142 [inline] BUG: KASAN: use-after-free in __refcount_add include/linux/refcount.h:193 [inline] BUG: KASAN: use-after-free in __refcount_inc include/linux/refcount.h:250 [inline] BUG: KASAN: use-after-free in refcount_inc include/linux/refcount.h:267 [inline] BUG: KASAN: use-after-free in io_init_req fs/io_uring.c:6700 [inline] BUG: KASAN: use-after-free in io_submit_sqes+0x15a9/0x25f0 fs/io_uring.c:6774 Write of size 4 at addr ffff888011e08e48 by task syz-executor165/8487 CPU: 1 PID: 8487 Comm: syz-executor165 Not tainted 5.10.0-rc1-next-20201102-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xae/0x4c8 mm/kasan/report.c:385 __kasan_report mm/kasan/report.c:545 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562 check_memory_region_inline mm/kasan/generic.c:186 [inline] check_memory_region+0x13d/0x180 mm/kasan/generic.c:192 instrument_atomic_read_write include/linux/instrumented.h:101 [inline] atomic_fetch_add_relaxed include/asm-generic/atomic-instrumented.h:142 [inline] __refcount_add include/linux/refcount.h:193 [inline] __refcount_inc include/linux/refcount.h:250 [inline] refcount_inc include/linux/refcount.h:267 [inline] io_init_req fs/io_uring.c:6700 [inline] io_submit_sqes+0x15a9/0x25f0 fs/io_uring.c:6774 __do_sys_io_uring_enter+0xc8e/0x1b50 fs/io_uring.c:9159 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x440e19 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 0f fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fff644ff178 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000440e19 RDX: 0000000000000000 RSI: 000000000000450c RDI: 0000000000000003 RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000022b4850 R13: 0000000000000010 R14: 0000000000000000 R15: 0000000000000000 Allocated by task 8487: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc.constprop.0+0xc2/0xd0 mm/kasan/common.c:461 kmalloc include/linux/slab.h:552 [inline] io_register_personality fs/io_uring.c:9638 [inline] __io_uring_register fs/io_uring.c:9874 [inline] __do_sys_io_uring_register+0x10f0/0x40a0 fs/io_uring.c:9924 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 8487: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track+0x1c/0x30 mm/kasan/common.c:56 kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355 __kasan_slab_free+0x102/0x140 mm/kasan/common.c:422 slab_free_hook mm/slub.c:1544 [inline] slab_free_freelist_hook+0x5d/0x150 mm/slub.c:1577 slab_free mm/slub.c:3140 [inline] kfree+0xdb/0x360 mm/slub.c:4122 io_identity_cow fs/io_uring.c:1380 [inline] io_prep_async_work+0x903/0xbc0 fs/io_uring.c:1492 io_prep_async_link fs/io_uring.c:1505 [inline] io_req_defer fs/io_uring.c:5999 [inline] io_queue_sqe+0x212/0xed0 fs/io_uring.c:6448 io_submit_sqe fs/io_uring.c:6542 [inline] io_submit_sqes+0x14f6/0x25f0 fs/io_uring.c:6784 __do_sys_io_uring_enter+0xc8e/0x1b50 fs/io_uring.c:9159 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at ffff888011e08e00 which belongs to the cache kmalloc-96 of size 96 The buggy address is located 72 bytes inside of 96-byte region [ffff888011e08e00, ffff888011e08e60) The buggy address belongs to the page: page:00000000a7104751 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11e08 flags: 0xfff00000000200(slab) raw: 00fff00000000200 ffffea00004f8540 0000001f00000002 ffff888010041780 raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888011e08d00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc ffff888011e08d80: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc > ffff888011e08e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ^ ffff888011e08e80: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc ffff888011e08f00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc ================================================================== Reported-by: syzbot+625ce3bb7835b63f7f3d@syzkaller.appspotmail.com Fixes: `1e6fa5216a` ("io_uring: COW io_identity on mismatch") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-04 10:22:57 -07:00
Jens Axboe	4b70cf9dea	io_uring: ensure consistent view of original task ->mm from SQPOLL Ensure we get a valid view of the task mm, by using task_lock() when attempting to grab the original task mm. Reported-by: syzbot+b57abf7ee60829090495@syzkaller.appspotmail.com Fixes: `2aede0e417` ("io_uring: stash ctx task reference for SQPOLL") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-04 10:22:57 -07:00
Jens Axboe	fdaf083cdf	io_uring: properly handle SQPOLL request cancelations Track if a given task io_uring context contains SQPOLL instances, so we can iterate those for cancelation (and request counts). This ensures that we properly wait on SQPOLL contexts, and find everything that needs canceling. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-11-04 10:22:56 -07:00
Linus Torvalds	cf9446cc8e	io_uring-5.10-2020-10-30 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+cRyAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiisD/9qmkOK7zfdh6HWyMAKm4m2GHMlhZy56VQ0 MklbKcYblfg69u1lmvcDv5/9l2h3ESxCMDYQbl/yuQ0MepK0PrDyndN3hVg8y8VW tRP6rHvOVBLH/R8C1ClfWJ2gVxrH776GOugV3q7wY8uD+caNug12kjV3YFVwychD akSoSzpCkN5BFfMkWgapcnvQD+SR5lPJeojru9kH94BIUC9zOCgkMVlZ1TAue8B4 VNHP5ghv/t4SWzmKiuLnboGUP6NVk9EPBPmVFNklfdr6kDpkKGRofVnS54/dcRRG JHpP0dvAVSjpKztW2f1fFeG/0OIRYuLuMS5SERrgIacIPVuz21i5VKpNYP7wKb24 oarxRtMBsOmkejfSPiSlGlQkcfB1j6K/13a+xIFkczT62SdO2wPcg/4BFuQx+yq0 Pw8gSXQ3QltcfsojojjQ61cnT1p0mSS7uObcgT6wVQQ8rFQaqSaZLhXFCvrb3731 28py3baghl0IrvFDaBjbJFbetGBhuaMxoBrr3B3sZsF5UMVHXUYgweJB+gGADE3s SlYaYHxgiraPSpl6F8zLse1WGPISRjchTArRcntgYlEXIlFrqWGNKOOIBD6y7OZe 3ARvPaUZsmi6oZ5SlEqTmAsSqZDo0UzyWzpB2yDBLY90Re/b2lwzhapgI4WbqX+W Bngw2TwZFg== =xYFz -----END PGP SIGNATURE----- Merge tag 'io_uring-5.10-2020-10-30' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: - Fixes for linked timeouts (Pavel) - Set IO_WQ_WORK_CONCURRENT early for async offload (Pavel) - Two minor simplifications that make the code easier to read and follow (Pavel) * tag 'io_uring-5.10-2020-10-30' of git://git.kernel.dk/linux-block: io_uring: use type appropriate io_kiocb handler for double poll io_uring: simplify __io_queue_sqe() io_uring: simplify nxt propagation in io_queue_sqe io_uring: don't miss setting IO_WQ_WORK_CONCURRENT io_uring: don't defer put of cancelled ltimeout io_uring: always clear LINK_TIMEOUT after cancel io_uring: don't adjust LINK_HEAD in cancel ltimeout io_uring: remove opcode check on ltimeout kill	2020-10-30 14:55:36 -07:00
Jens Axboe	c8b5e2600a	io_uring: use type appropriate io_kiocb handler for double poll io_poll_double_wake() is called for both request types - both pure poll requests, and internal polls. This means that we should be using the right handler based on the request type. Use the one that the original caller already assigned for the waitqueue handling, that will always match the correct type. Cc: stable@vger.kernel.org # v5.8+ Reported-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-25 13:53:26 -06:00
Linus Torvalds	af0041875c	io_uring-5.10-2020-10-24 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+UQh8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpl7WEADOTslFOof1RUPMb0Qvj4GO4cjvoFLW7KLt B83PmlW3WJpZrSiqZlrSPwcDELVphw67RL/2hp0jAfT1t00OdCOYQDmh7+kg9lnI fzu4NzfTKbriRWEtodIqZCiDoGXjzJGxNffhxPEt33YxRErI/fvuD/TzxwGGUInW OZ3Aze9Nj2DQ/eXhio48n4letTK6xNsjGDWvzwinthHWeBbID01isLlTei20PKU5 Dk1buueUuEr/vNjJwEeRd8yDXZeLZ/br3gw/3B71MJoi2PUaXvuS8DV4LmXg2SS5 yN0udSNk4AP/UlrVqN9bEqdbSTBSf2JIEW3k3/SEUjcjw6hMnbLeoW2vZx6Xvk6T vvAVHesLpCu8oEdWAkFm6Rb6ptJ1XpRrWWYxi1J1SB2Y8cGyGS1GoZWWPknM5M3I b1dNj18Bb+MmFvuKr7YYrb77tECuywxTHVGj6WwBOIlYrg44XQOumYYH9OmvZFz1 6vWaXjLPOIM8fpAKX5Tx5sAy/FMl17H8I5AD2bZVvD0h0MqzLnvHEYahcAfOfb9y qpkdGnbAWo6IIkCrDcSOV4q6dmWu3as9eSs1j/6Xl4WoJ2MT9C//Gpv7iNMxxozy CznEPcbA8N9QazQmoebtB3gTBVyGUUKVDdVNzleMj9KD6yPlKFZ6+FZdikX59I9M t9QGh3+gow== =xidc -----END PGP SIGNATURE----- Merge tag 'io_uring-5.10-2020-10-24' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: - fsize was missed in previous unification of work flags - Few fixes cleaning up the flags unification creds cases (Pavel) - Fix NUMA affinities for completely unplugged/replugged node for io-wq - Two fallout fixes from the set_fs changes. One local to io_uring, one for the splice entry point that io_uring uses. - Linked timeout fixes (Pavel) - Removal of ->flush() ->files work-around that we don't need anymore with referenced files (Pavel) - Various cleanups (Pavel) * tag 'io_uring-5.10-2020-10-24' of git://git.kernel.dk/linux-block: splice: change exported internal do_splice() helper to take kernel offset io_uring: make loop_rw_iter() use original user supplied pointers io_uring: remove req cancel in ->flush() io-wq: re-set NUMA node affinities if CPUs come online io_uring: don't reuse linked_timeout io_uring: unify fsize with def->work_flags io_uring: fix racy REQ_F_LINK_TIMEOUT clearing io_uring: do poll's hash_node init in common code io_uring: inline io_poll_task_handler() io_uring: remove extra ->file check in poll prep io_uring: make cached_cq_overflow non atomic_t io_uring: inline io_fail_links() io_uring: kill ref get/drop in personality init io_uring: flags-based creds init in queue	2020-10-24 12:40:18 -07:00
Pavel Begunkov	0d63c148d6	io_uring: simplify __io_queue_sqe() Restructure __io_queue_sqe() so it follows simple if/else if/else control flow. It's more readable and removes extra goto/labels. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-23 13:07:12 -06:00
Pavel Begunkov	9aaf354352	io_uring: simplify nxt propagation in io_queue_sqe Don't overuse goto's, complex control flow doesn't make compilers happy and makes code harder to read. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-23 13:07:12 -06:00
Pavel Begunkov	feaadc4fc2	io_uring: don't miss setting IO_WQ_WORK_CONCURRENT Set IO_WQ_WORK_CONCURRENT for all REQ_F_FORCE_ASYNC requests, do that in that is also looks better. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-23 13:07:11 -06:00
Pavel Begunkov	c9abd7ad83	io_uring: don't defer put of cancelled ltimeout Inline io_link_cancel_timeout() and __io_kill_linked_timeout() into io_kill_linked_timeout(). That allows to easily move a put of a cancelled linked timeout out of completion_lock and to not deferring it. It is also much more readable when not scattered across three different functions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-23 13:07:11 -06:00
Pavel Begunkov	cdfcc3ee04	io_uring: always clear LINK_TIMEOUT after cancel Move REQ_F_LINK_TIMEOUT clearing out of __io_kill_linked_timeout() because it might return early and leave the flag set. It's not a problem, but may be confusing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-23 13:07:11 -06:00
Pavel Begunkov	ac877d2edd	io_uring: don't adjust LINK_HEAD in cancel ltimeout An armed linked timeout can never be a head of a link, so we don't need to clear REQ_F_LINK_HEAD for it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-23 13:07:11 -06:00
Pavel Begunkov	e08102d507	io_uring: remove opcode check on ltimeout kill __io_kill_linked_timeout() already checks for REQ_F_LTIMEOUT_ACTIVE and it's set only for linked timeouts. No need to verify next request's opcode. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-23 13:07:11 -06:00
Linus Torvalds	4a22709e21	arch-cleanup-2020-10-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+SOXIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgptrcD/93VUDmRAn73ChKNd0TtXUicJlAlNLVjvfs VFTXWBDnlJnGkZT7ElkDD9b8dsz8l4xGf/QZ5dzhC/th2OsfObQkSTfe0lv5cCQO mX7CRSrDpjaHtW+WGPDa0oQsGgIfpqUz2IOg9NKbZZ1LJ2uzYfdOcf3oyRgwZJ9B I3sh1vP6OzjZVVCMmtMTM+sYZEsDoNwhZwpkpiwMmj8tYtOPgKCYKpqCiXrGU0x2 ML5FtDIwiwU+O3zYYdCBWqvCb2Db0iA9Aov2whEBz/V2jnmrN5RMA/90UOh1E2zG br4wM1Wt3hNrtj5qSxZGlF/HEMYJVB8Z2SgMjYu4vQz09qRVVqpGdT/dNvLAHQWg w4xNCj071kVZDQdfwnqeWSKYUau9Xskvi8xhTT+WX8a5CsbVrM9vGslnS5XNeZ6p h2D3Q+TAYTvT756icTl0qsYVP7PrPY7DdmQYu0q+Lc3jdGI+jyxO2h9OFBRLZ3p6 zFX2N8wkvvCCzP2DwVnnhIi/GovpSh7ksHnb039F36Y/IhZPqV1bGqdNQVdanv6I 8fcIDM6ltRQ7dO2Br5f1tKUZE9Pm6x60b/uRVjhfVh65uTEKyGRhcm5j9ztzvQfI cCBg4rbVRNKolxuDEkjsAFXVoiiEEsb7pLf4pMO+Dr62wxFG589tQNySySneUIVZ J9ILnGAAeQ== =aVWo -----END PGP SIGNATURE----- Merge tag 'arch-cleanup-2020-10-22' of git://git.kernel.dk/linux-block Pull arch task_work cleanups from Jens Axboe: "Two cleanups that don't fit other categories: - Finally get the task_work_add() cleanup done properly, so we don't have random 0/1/false/true/TWA_SIGNAL confusing use cases. Updates all callers, and also fixes up the documentation for task_work_add(). - While working on some TIF related changes for 5.11, this TIF_NOTIFY_RESUME cleanup fell out of that. Remove some arch duplication for how that is handled" * tag 'arch-cleanup-2020-10-22' of git://git.kernel.dk/linux-block: task_work: cleanup notification modes tracehook: clear TIF_NOTIFY_RESUME in tracehook_notify_resume()	2020-10-23 10:06:38 -07:00
Jens Axboe	4017eb91a9	io_uring: make loop_rw_iter() use original user supplied pointers We jump through a hoop for fixed buffers, where we first map these to a bvec(), then kmap() the bvec to obtain the pointer we copy to/from. This was always a bit ugly, and with the set_fs changes, it ends up being practically problematic as well. There's no need to jump through these hoops, just use the original user pointers and length for the non iter based read/write. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-22 14:14:12 -06:00
Pavel Begunkov	c8fb20b5b4	io_uring: remove req cancel in ->flush() Every close(io_uring) causes cancellation of all inflight requests carrying ->files. That's not nice but was neccessary up until recently. Now task->files removal is handled in the core code, so that part of flush can be removed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-22 09:54:19 -06:00
Pavel Begunkov	ff5771613c	io_uring: don't reuse linked_timeout Clear linked_timeout for next requests in __io_queue_sqe() so we won't queue it up unnecessary when it's going to be punted. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Cc: stable@vger.kernel.org # v5.9 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-21 16:37:56 -06:00
Jens Axboe	69228338c9	io_uring: unify fsize with def->work_flags This one was missed in the earlier conversion, should be included like any of the other IO identity flags. Make sure we restore to RLIM_INIFITY when dropping the personality again. Fixes: `98447d65b4` ("io_uring: move io identity items into separate struct") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-20 16:03:13 -06:00
Linus Torvalds	4962a85696	io_uring-5.10-2020-10-20 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+O9WEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqajEADSD5PiO94YWTtVNWFUjn5RW+GlCE70/0VV XImdasmvM8nb48E+z2EW0Ky4vKXeVy5r+WZAeIYqPUHy/ogQDVpEn00NL7tFQmOz 8UYrlZ3LLE/bSeWM5iavgG7TldVs/ZfspJ0hj3/Ac7jJpzuRGEI5TClsxJ0mWV39 b2qT4OYBDhwvwVPZ/qhWgEwJXJFzFywckouIbMw8gPkveebUYeyu/yuScNGwYuiQ 46YPEk/XIuOy8iUvQjqhLY+NNlAKJwt3z9WZgt5F+TZIhkpp7z6h20+jezFQcuFP GXzIDN+EADpsbw7MWJYIZVffxDEMlDpkJlAVMT1hsYLDfTXoEzmFwRddoFh2Fjf6 ghWqhOKffUuAOX2xs1MrS2xLaxd0ot7QqZJVTYk7zEljkaRANlstSZZ+PpI+Sad/ rNieQvs6jnsmTODDEaV3qyFX5aBQ2NdvyndZNU9wz0GZAWAdz+wxE0A1FVD0A37i p6m8sIvhNg3/cW89G04IDYUkAygi8knVDnEDHRwaJtswZQ4pRSGMp+N4qZ0GpnK7 BviaAhofGaYlqruavO6Ug2YyomYpWGlUxTaB9ZKh0HkEDlDM945+0sgQRdxfsE8d OboycqJn3puOl/wh5Fc4oGYrWLsDbaA/5kksC4lm85Z+HUf+UXMS4QFdoPJYjhuM H6oMz1w2bQ== =v56S -----END PGP SIGNATURE----- Merge tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: "A mix of fixes and a few stragglers. In detail: - Revert the bogus __read_mostly that we discussed for the initial pull request. - Fix a merge window regression with fixed file registration error path handling. - Fix io-wq numa node affinities. - Series abstracting out an io_identity struct, making it both easier to see what the personality items are, and also easier to to adopt more. Use this to cover audit logging. - Fix for read-ahead disabled block condition in async buffered reads, and using single page read-ahead to unify what generic_file_buffer_read() path is used. - Series for REQ_F_COMP_LOCKED fix and removal of it (Pavel) - Poll fix (Pavel)" * tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block: (21 commits) io_uring: use blk_queue_nowait() to check if NOWAIT supported mm: use limited read-ahead to satisfy read mm: mark async iocb read as NOWAIT once some data has been copied io_uring: fix double poll mask init io-wq: inherit audit loginuid and sessionid io_uring: use percpu counters to track inflight requests io_uring: assign new io_identity for task if members have changed io_uring: store io_identity in io_uring_task io_uring: COW io_identity on mismatch io_uring: move io identity items into separate struct io_uring: rely solely on work flags to determine personality. io_uring: pass required context in as flags io-wq: assign NUMA node locality if appropriate io_uring: fix error path cleanup in io_sqe_files_register() Revert "io_uring: mark io_uring_fops/io_op_defs as __read_mostly" io_uring: fix REQ_F_COMP_LOCKED by killing it io_uring: dig out COMP_LOCK from deep call chain io_uring: don't put a poll req under spinlock io_uring: don't unnecessarily clear F_LINK_TIMEOUT io_uring: don't set COMP_LOCKED if won't put ...	2020-10-20 13:19:30 -07:00
Pavel Begunkov	900fad45dc	io_uring: fix racy REQ_F_LINK_TIMEOUT clearing io_link_timeout_fn() removes REQ_F_LINK_TIMEOUT from the link head's flags, it's not atomic and may race with what the head is doing. If io_link_timeout_fn() doesn't clear the flag, as forced by this patch, then it may happen that for "req -> link_timeout1 -> link_timeout2", __io_kill_linked_timeout() would find link_timeout2 and try to cancel it, so miscounting references. Teach it to ignore such double timeouts by marking the active one with a new flag in io_prep_linked_timeout(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-19 13:29:51 -06:00
Pavel Begunkov	4d52f33899	io_uring: do poll's hash_node init in common code Move INIT_HLIST_NODE(&req->hash_node) into __io_arm_poll_handler(), so that it doesn't duplicated and common poll code would be responsible for it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-19 13:29:29 -06:00
Pavel Begunkov	dd221f46f6	io_uring: inline io_poll_task_handler() io_poll_task_handler() doesn't add clarity, inline it in its only user. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-19 13:29:29 -06:00
Pavel Begunkov	069b89384d	io_uring: remove extra ->file check in poll prep io_poll_add_prep() doesn't need to verify ->file because it's already done in io_init_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-19 13:29:29 -06:00
Pavel Begunkov	2c3bac6dd6	io_uring: make cached_cq_overflow non atomic_t ctx->cached_cq_overflow is changed only under completion_lock. Convert it from atomic_t to just int, and mark all places when it's read without lock with READ_ONCE, which guarantees atomicity (relaxed ordering). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-19 13:29:29 -06:00
Pavel Begunkov	d148ca4b07	io_uring: inline io_fail_links() Inline io_fail_links() and kill extra io_cqring_ev_posted(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-19 13:29:29 -06:00
Pavel Begunkov	ec99ca6c47	io_uring: kill ref get/drop in personality init Don't take an identity on personality/creds init only to drop it a few lines after. Extract a function which prepares req->work but leaves it without identity. Note: it's safe to not check REQ_F_WORK_INITIALIZED there because it's nobody had a chance to init it before io_init_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-19 13:29:29 -06:00
Pavel Begunkov	2e5aa6cb4d	io_uring: flags-based creds init in queue Use IO_WQ_WORK_CREDS to figure out if req has creds to be used. Since recently it should rely only on flags, but not value of work.creds. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-19 13:29:29 -06:00
Jeffle Xu	9ba0d0c812	io_uring: use blk_queue_nowait() to check if NOWAIT supported commit `021a24460d` ("block: add QUEUE_FLAG_NOWAIT") adds a new helper function blk_queue_nowait() to check if the bdev supports handling of REQ_NOWAIT or not. Since then bio-based dm device can also support REQ_NOWAIT, and currently only dm-linear supports that since commit `6abc49468e` ("dm: add support for REQ_NOWAIT and enable it for linear target"). Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-19 07:32:36 -06:00
Minchan Kim	0726b01e70	mm/madvise: pass mm to do_madvise Patch series "introduce memory hinting API for external process", v9. Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With that, application could give hints to kernel what memory range are preferred to be reclaimed. However, in some platform(e.g., Android), the information required to make the hinting decision is not known to the app. Instead, it is known to a centralized userspace daemon(e.g., ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the concern, this patch introduces new syscall - process_madvise(2). Bascially, it's same with madvise(2) syscall but it has some differences. 1. It needs pidfd of target process to provide the hint 2. It supports only MADV_{COLD\|PAGEOUT\|MERGEABLE\|UNMEREABLE} at this moment. Other hints in madvise will be opened when there are explicit requests from community to prevent unexpected bugs we couldn't support. 3. Only privileged processes can do something for other process's address space. For more detail of the new API, please see "mm: introduce external memory hinting API" description in this patchset. This patch (of 3): In upcoming patches, do_madvise will be called from external process context so we shouldn't asssume "current" is always hinted process's task_struct. Furthermore, we must not access mm_struct via task->mm, but obtain it via access_mm() once (in the following patch) and only use that pointer [1], so pass it to do_madvise() as well. Note the vma->vm_mm pointers are safe, so we can use them further down the call stack. And let's pass current->mm as arguments of do_madvise so it shouldn't change existing behavior but prepare next patch to make review easy. [vbabka@suse.cz: changelog tweak] [minchan@kernel.org: use current->mm for io_uring] Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org [akpm@linux-foundation.org: fix it for upstream changes] [akpm@linux-foundation.org: whoops] [rdunlap@infradead.org: add missing includes] Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jann Horn <jannh@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Daniel Colascione <dancol@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: John Dias <joaodias@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: Christian Brauner <christian@brauner.io> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-10-18 09:27:09 -07:00
Jens Axboe	91989c7078	task_work: cleanup notification modes A previous commit changed the notification mode from true/false to an int, allowing notify-no, notify-yes, or signal-notify. This was backwards compatible in the sense that any existing true/false user would translate to either 0 (on notification sent) or 1, the latter which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2. Clean this up properly, and define a proper enum for the notification mode. Now we have: - TWA_NONE. This is 0, same as before the original change, meaning no notification requested. - TWA_RESUME. This is 1, same as before the original change, meaning that we use TIF_NOTIFY_RESUME. - TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the notification. Clean up all the callers, switching their 0/1/false/true to using the appropriate TWA_* mode for notifications. Fixes: `e91b481623` ("task_work: teach task_work_add() to do signal_wake_up()") Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 15:05:30 -06:00
Pavel Begunkov	58852d4d67	io_uring: fix double poll mask init __io_queue_proc() is used by both, poll reqs and apoll. Don't use req->poll.events to copy poll mask because for apoll it aliases with private data of the request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:47 -06:00
Jens Axboe	4ea33a976b	io-wq: inherit audit loginuid and sessionid Make sure the async io-wq workers inherit the loginuid and sessionid from the original task, and restore them to unset once we're done with the async work item. While at it, disable the ability for kernel threads to write to their own loginuid. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:47 -06:00
Jens Axboe	d8a6df10aa	io_uring: use percpu counters to track inflight requests Even though we place the req_issued and req_complete in separate cachelines, there's considerable overhead in doing the atomics particularly on the completion side. Get rid of having the two counters, and just use a percpu_counter for this. That's what it was made for, after all. This considerably reduces the overhead in __io_free_req(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:47 -06:00
Jens Axboe	500a373d73	io_uring: assign new io_identity for task if members have changed This avoids doing a copy for each new async IO, if some parts of the io_identity has changed. We avoid reference counting for the normal fast path of nothing ever changing. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:46 -06:00
Jens Axboe	5c3462cfd1	io_uring: store io_identity in io_uring_task This is, by definition, a per-task structure. So store it in the task context, instead of doing carrying it in each io_kiocb. We're being a bit inefficient if members have changed, as that requires an alloc and copy of a new io_identity struct. The next patch will fix that up. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:46 -06:00
Jens Axboe	1e6fa5216a	io_uring: COW io_identity on mismatch If the io_identity doesn't completely match the task, then create a copy of it and use that. The existing copy remains valid until the last user of it has gone away. This also changes the personality lookup to be indexed by io_identity, instead of creds directly. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:46 -06:00
Jens Axboe	98447d65b4	io_uring: move io identity items into separate struct io-wq contains a pointer to the identity, which we just hold in io_kiocb for now. This is in preparation for putting this outside io_kiocb. The only exception is struct files_struct, which we'll need different rules for to avoid a circular dependency. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:45 -06:00
Jens Axboe	dfead8a8e2	io_uring: rely solely on work flags to determine personality. We solely rely on work->work_flags now, so use that for proper checking and clearing/dropping of various identity items. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:45 -06:00
Jens Axboe	0f20376588	io_uring: pass required context in as flags We have a number of bits that decide what context to inherit. Set up io-wq flags for these instead. This is in preparation for always having the various members set, but not always needing them for all requests. No intended functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:45 -06:00
Jens Axboe	55cbc2564a	io_uring: fix error path cleanup in io_sqe_files_register() syzbot reports the following crash: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 1 PID: 8927 Comm: syz-executor.3 Not tainted 5.9.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:io_file_from_index fs/io_uring.c:5963 [inline] RIP: 0010:io_sqe_files_register fs/io_uring.c:7369 [inline] RIP: 0010:__io_uring_register fs/io_uring.c:9463 [inline] RIP: 0010:__do_sys_io_uring_register+0x2fd2/0x3ee0 fs/io_uring.c:9553 Code: ec 03 49 c1 ee 03 49 01 ec 49 01 ee e8 57 61 9c ff 41 80 3c 24 00 0f 85 9b 09 00 00 4d 8b af b8 01 00 00 4c 89 e8 48 c1 e8 03 <80> 3c 28 00 0f 85 76 09 00 00 49 8b 55 00 89 d8 c1 f8 09 48 98 4c RSP: 0018:ffffc90009137d68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9000ef2a000 RDX: 0000000000040000 RSI: ffffffff81d81dd9 RDI: 0000000000000005 RBP: dffffc0000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffed1012882a37 R13: 0000000000000000 R14: ffffed1012882a38 R15: ffff888094415000 FS: 00007f4266f3c700(0000) GS:ffff8880ae500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000118c000 CR3: 000000008e57d000 CR4: 00000000001506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45de59 Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f4266f3bc78 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab RAX: ffffffffffffffda RBX: 00000000000083c0 RCX: 000000000045de59 RDX: 0000000020000280 RSI: 0000000000000002 RDI: 0000000000000005 RBP: 000000000118bf68 R08: 0000000000000000 R09: 0000000000000000 R10: 40000000000000a1 R11: 0000000000000246 R12: 000000000118bf2c R13: 00007fff2fa4f12f R14: 00007f4266f3c9c0 R15: 000000000118bf2c Modules linked in: ---[ end trace 2a40a195e2d5e6e6 ]--- RIP: 0010:io_file_from_index fs/io_uring.c:5963 [inline] RIP: 0010:io_sqe_files_register fs/io_uring.c:7369 [inline] RIP: 0010:__io_uring_register fs/io_uring.c:9463 [inline] RIP: 0010:__do_sys_io_uring_register+0x2fd2/0x3ee0 fs/io_uring.c:9553 Code: ec 03 49 c1 ee 03 49 01 ec 49 01 ee e8 57 61 9c ff 41 80 3c 24 00 0f 85 9b 09 00 00 4d 8b af b8 01 00 00 4c 89 e8 48 c1 e8 03 <80> 3c 28 00 0f 85 76 09 00 00 49 8b 55 00 89 d8 c1 f8 09 48 98 4c RSP: 0018:ffffc90009137d68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9000ef2a000 RDX: 0000000000040000 RSI: ffffffff81d81dd9 RDI: 0000000000000005 RBP: dffffc0000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffed1012882a37 R13: 0000000000000000 R14: ffffed1012882a38 R15: ffff888094415000 FS: 00007f4266f3c700(0000) GS:ffff8880ae400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000074a918 CR3: 000000008e57d000 CR4: 00000000001506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 which is a copy of fget failure condition jumping to cleanup, but the cleanup requires ctx->file_data to be assigned. Assign it when setup, and ensure that we clear it again for the error path exit. Fixes: `5398ae6985` ("io_uring: clean file_data access in files_register") Reported-by: syzbot+f4ebcc98223dafd8991e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:44 -06:00
Jens Axboe	0918682be4	Revert "io_uring: mark io_uring_fops/io_op_defs as __read_mostly" This reverts commit `738277adc8`. This change didn't make a lot of sense, and as Linus reports, it actually fails on clang: /tmp/io_uring-dd40c4.s:26476: Warning: ignoring changed section attributes for .data..read_mostly The arrays are already marked const so, by definition, they are not just read-mostly, they are read-only. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:43 -06:00
Pavel Begunkov	216578e55a	io_uring: fix REQ_F_COMP_LOCKED by killing it REQ_F_COMP_LOCKED is used and implemented in a buggy way. The problem is that the flag is set before io_put_req() but not cleared after, and if that wasn't the final reference, the request will be freed with the flag set from some other context, which may not hold a spinlock. That means possible races with removing linked timeouts and unsynchronised completion (e.g. access to CQ). Instead of fixing REQ_F_COMP_LOCKED, kill the flag and use task_work_add() to move such requests to a fresh context to free from it, as was done with __io_free_req_finish(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:43 -06:00
Pavel Begunkov	4edf20f999	io_uring: dig out COMP_LOCK from deep call chain io_req_clean_work() checks REQ_F_COMP_LOCK to pass this two layers up. Move the check up into __io_free_req(), so at least it doesn't looks so ugly and would facilitate further changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:43 -06:00
Pavel Begunkov	6a0af224c2	io_uring: don't put a poll req under spinlock Move io_put_req() in io_poll_task_handler() from under spinlock. This eliminates the need to use REQ_F_COMP_LOCKED, at the expense of potentially having to grab the lock again. That's still a better trade off than relying on the locked flag. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:42 -06:00
Pavel Begunkov	b1b74cfc19	io_uring: don't unnecessarily clear F_LINK_TIMEOUT If a request had REQ_F_LINK_TIMEOUT it would've been cleared in __io_kill_linked_timeout() by the time of __io_fail_links(), so no need to care about it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:42 -06:00
Pavel Begunkov	368c5481ae	io_uring: don't set COMP_LOCKED if won't put __io_kill_linked_timeout() sets REQ_F_COMP_LOCKED for a linked timeout even if it can't cancel it, e.g. it's already running. It not only races with io_link_timeout_fn() for ->flags field, but also leaves the flag set and so io_link_timeout_fn() may find it and decide that it holds the lock. Hopefully, the second problem is potential. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:42 -06:00
Colin Ian King	035fbafc7a	io_uring: Fix sizeof() mismatch An incorrect sizeof() is being used, sizeof(file_data->table) is not correct, it should be sizeof(*file_data->table). Fixes: `5398ae6985` ("io_uring: clean file_data access in files_register") Signed-off-by: Colin Ian King <colin.king@canonical.com> Addresses-Coverity: ("Sizeof not portable (SIZEOF_MISMATCH)") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 09:25:41 -06:00
Linus Torvalds	9ff9b0d392	networking changes for the 5.10 merge window Add redirect_neigh() BPF packet redirect helper, allowing to limit stack traversal in common container configs and improving TCP back-pressure. Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain. Expand netlink policy support and improve policy export to user space. (Ge)netlink core performs request validation according to declared policies. Expand the expressiveness of those policies (min/max length and bitmasks). Allow dumping policies for particular commands. This is used for feature discovery by user space (instead of kernel version parsing or trial and error). Support IGMPv3/MLDv2 multicast listener discovery protocols in bridge. Allow more than 255 IPv4 multicast interfaces. Add support for Type of Service (ToS) reflection in SYN/SYN-ACK packets of TCPv6. In Multi-patch TCP (MPTCP) support concurrent transmission of data on multiple subflows in a load balancing scenario. Enhance advertising addresses via the RM_ADDR/ADD_ADDR options. Support SMC-Dv2 version of SMC, which enables multi-subnet deployments. Allow more calls to same peer in RxRPC. Support two new Controller Area Network (CAN) protocols - CAN-FD and ISO 15765-2:2016. Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit kernel problem. Add TC actions for implementing MPLS L2 VPNs. Improve nexthop code - e.g. handle various corner cases when nexthop objects are removed from groups better, skip unnecessary notifications and make it easier to offload nexthops into HW by converting to a blocking notifier. Support adding and consuming TCP header options by BPF programs, opening the doors for easy experimental and deployment-specific TCP option use. Reorganize TCP congestion control (CC) initialization to simplify life of TCP CC implemented in BPF. Add support for shipping BPF programs with the kernel and loading them early on boot via the User Mode Driver mechanism, hence reusing all the user space infra we have. Support sleepable BPF programs, initially targeting LSM and tracing. Add bpf_d_path() helper for returning full path for given 'struct path'. Make bpf_tail_call compatible with bpf-to-bpf calls. Allow BPF programs to call map_update_elem on sockmaps. Add BPF Type Format (BTF) support for type and enum discovery, as well as support for using BTF within the kernel itself (current use is for pretty printing structures). Support listing and getting information about bpf_links via the bpf syscall. Enhance kernel interfaces around NIC firmware update. Allow specifying overwrite mask to control if settings etc. are reset during update; report expected max time operation may take to users; support firmware activation without machine reboot incl. limits of how much impact reset may have (e.g. dropping link or not). Extend ethtool configuration interface to report IEEE-standard counters, to limit the need for per-vendor logic in user space. Adopt or extend devlink use for debug, monitoring, fw update in many drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx, dpaa2-eth). In mlxsw expose critical and emergency SFP module temperature alarms. Refactor port buffer handling to make the defaults more suitable and support setting these values explicitly via the DCBNL interface. Add XDP support for Intel's igb driver. Support offloading TC flower classification and filtering rules to mscc_ocelot switches. Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as fixed interval period pulse generator and one-step timestamping in dpaa-eth. Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3) offload. Add Lynx PHY/PCS MDIO module, and convert various drivers which have this HW to use it. Convert mvpp2 to split PCS. Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as 7-port Mediatek MT7531 IP. Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver, and wcn3680 support in wcn36xx. Improve performance for packets which don't require much offloads on recent Mellanox NICs by 20% by making multiple packets share a descriptor entry. Move chelsio inline crypto drivers (for TLS and IPsec) from the crypto subtree to drivers/net. Move MDIO drivers out of the phy directory. Clean up a lot of W=1 warnings, reportedly the actively developed subsections of networking drivers should now build W=1 warning free. Make sure drivers don't use in_interrupt() to dynamically adapt their code. Convert tasklets to use new tasklet_setup API (sadly this conversion is not yet complete). Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+ItRwACgkQMUZtbf5S IrtTMg//UxpdR/MirT1DatBU0K/UGAZY82hV7F/UC8tPgjfHZeHvWlDFxfi3YP81 PtPKbhRZ7DhwBXefUp6nY3UdvjftrJK2lJm8prJUPSsZRye8Wlcb7y65q7/P2y2U Efucyopg6RUrmrM0DUsIGYGJgylQLHnMYUl/keCsD4t5Bp4ksyi9R2t5eitGoWzh r3QGdbSa0AuWx4iu0i+tqp6Tj0ekMBMXLVb35dtU1t0joj2KTNEnSgABN3prOa8E iWYf2erOau68Ogp3yU3miCy0ZU4p/7qGHTtzbcp677692P/ekak6+zmfHLT9/Pjy 2Stq2z6GoKuVxdktr91D9pA3jxG4LxSJmr0TImcGnXbvkMP3Ez3g9RrpV5fn8j6F mZCH8TKZAoD5aJrAJAMkhZmLYE1pvDa7KolSk8WogXrbCnTEb5Nv8FHTS1Qnk3yl wSKXuvutFVNLMEHCnWQLtODbTST9DI/aOi6EctPpuOA/ZyL1v3pl+gfp37S+LUTe owMnT/7TdvKaTD0+gIyU53M6rAWTtr5YyRQorX9awIu/4Ha0F0gYD7BJZQUGtegp HzKt59NiSrFdbSH7UdyemdBF4LuCgIhS7rgfeoUXMXmuPHq7eHXyHZt5dzPPa/xP 81P0MAvdpFVwg8ij2yp2sHS7sISIRKq17fd1tIewUabxQbjXqPc= =bc1U -----END PGP SIGNATURE----- Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: - Add redirect_neigh() BPF packet redirect helper, allowing to limit stack traversal in common container configs and improving TCP back-pressure. Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain. - Expand netlink policy support and improve policy export to user space. (Ge)netlink core performs request validation according to declared policies. Expand the expressiveness of those policies (min/max length and bitmasks). Allow dumping policies for particular commands. This is used for feature discovery by user space (instead of kernel version parsing or trial and error). - Support IGMPv3/MLDv2 multicast listener discovery protocols in bridge. - Allow more than 255 IPv4 multicast interfaces. - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK packets of TCPv6. - In Multi-patch TCP (MPTCP) support concurrent transmission of data on multiple subflows in a load balancing scenario. Enhance advertising addresses via the RM_ADDR/ADD_ADDR options. - Support SMC-Dv2 version of SMC, which enables multi-subnet deployments. - Allow more calls to same peer in RxRPC. - Support two new Controller Area Network (CAN) protocols - CAN-FD and ISO 15765-2:2016. - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit kernel problem. - Add TC actions for implementing MPLS L2 VPNs. - Improve nexthop code - e.g. handle various corner cases when nexthop objects are removed from groups better, skip unnecessary notifications and make it easier to offload nexthops into HW by converting to a blocking notifier. - Support adding and consuming TCP header options by BPF programs, opening the doors for easy experimental and deployment-specific TCP option use. - Reorganize TCP congestion control (CC) initialization to simplify life of TCP CC implemented in BPF. - Add support for shipping BPF programs with the kernel and loading them early on boot via the User Mode Driver mechanism, hence reusing all the user space infra we have. - Support sleepable BPF programs, initially targeting LSM and tracing. - Add bpf_d_path() helper for returning full path for given 'struct path'. - Make bpf_tail_call compatible with bpf-to-bpf calls. - Allow BPF programs to call map_update_elem on sockmaps. - Add BPF Type Format (BTF) support for type and enum discovery, as well as support for using BTF within the kernel itself (current use is for pretty printing structures). - Support listing and getting information about bpf_links via the bpf syscall. - Enhance kernel interfaces around NIC firmware update. Allow specifying overwrite mask to control if settings etc. are reset during update; report expected max time operation may take to users; support firmware activation without machine reboot incl. limits of how much impact reset may have (e.g. dropping link or not). - Extend ethtool configuration interface to report IEEE-standard counters, to limit the need for per-vendor logic in user space. - Adopt or extend devlink use for debug, monitoring, fw update in many drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx, dpaa2-eth). - In mlxsw expose critical and emergency SFP module temperature alarms. Refactor port buffer handling to make the defaults more suitable and support setting these values explicitly via the DCBNL interface. - Add XDP support for Intel's igb driver. - Support offloading TC flower classification and filtering rules to mscc_ocelot switches. - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as fixed interval period pulse generator and one-step timestamping in dpaa-eth. - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3) offload. - Add Lynx PHY/PCS MDIO module, and convert various drivers which have this HW to use it. Convert mvpp2 to split PCS. - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as 7-port Mediatek MT7531 IP. - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver, and wcn3680 support in wcn36xx. - Improve performance for packets which don't require much offloads on recent Mellanox NICs by 20% by making multiple packets share a descriptor entry. - Move chelsio inline crypto drivers (for TLS and IPsec) from the crypto subtree to drivers/net. Move MDIO drivers out of the phy directory. - Clean up a lot of W=1 warnings, reportedly the actively developed subsections of networking drivers should now build W=1 warning free. - Make sure drivers don't use in_interrupt() to dynamically adapt their code. Convert tasklets to use new tasklet_setup API (sadly this conversion is not yet complete). * tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits) Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH" net, sockmap: Don't call bpf_prog_put() on NULL pointer bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo bpf, sockmap: Add locking annotations to iterator netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements net: fix pos incrementment in ipv6_route_seq_next net/smc: fix invalid return code in smcd_new_buf_create() net/smc: fix valid DMBE buffer sizes net/smc: fix use-after-free of delayed events bpfilter: Fix build error with CONFIG_BPFILTER_UMH cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info bpf: Fix register equivalence tracking. rxrpc: Fix loss of final ack on shutdown rxrpc: Fix bundle counting for exclusive connections netfilter: restore NF_INET_NUMHOOKS ibmveth: Identify ingress large send packets. ibmveth: Switch order of ibmveth_helper calls. cxgb4: handle 4-tuple PEDIT to NAT mode translation selftests: Add VRF route leaking tests ...	2020-10-15 18:42:13 -07:00
Linus Torvalds	6ad4bf6ea1	io_uring-5.10-2020-10-12 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EXPEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiR4EAC3trm1ojXVF7y9/XRhcPpb4Pror+ZA1coO gyoy+zUuCEl9WCzzHWqXULMYMP0YzNJnJs0oLQPA1s0sx1H4uDMl/UXg0OXZisYG Y59Kca3c1DHFwj9KPQXfGmCEjc/rbDWK5TqRc2iZMz+6E5Mt71UFZHtenwgV1zD8 hTmZZkzLCu2ePfOvrPONgL5tDqPWGVyn61phoC7cSzMF66juXGYuvQGktzi/m6q+ jAxUnhKvKTlLB9wsq3s5X/20/QD56Yuba9U+YxeeNDBE8MDWQOsjz0mZCV1fn4p3 h/6762aRaWaXH7EwMtsHFUWy7arJZg/YoFYNYLv4Ksyy3y4sMABZCy3A+JyzrgQ0 hMu7vjsP+k22X1WH8nyejBfWNEmxu6dpgckKrgF0dhJcXk/acWA3XaDWZ80UwfQy isKRAP1rA0MJKHDMIwCzSQJDPvtUAkPptbNZJcUSU78o+pPoCaQ93V++LbdgGtKn iGJJqX05dVbcsDx5X7fluphjkUTC4yFr7ZgLgbOIedXajWRD8iOkO2xxCHk6SKFl iv9entvRcX9k3SHK9uffIUkRBUujMU0+HCIQFCO1qGmkCaS5nSrovZl4HoL7L/Dj +T8+v7kyJeklLXgJBaE7jk01O4HwZKjwPWMbCjvL9NKk8j7c1soYnRu5uNvi85Mu /9wn671s+w== =udgj -----END PGP SIGNATURE----- Merge tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: - Add blkcg accounting for io-wq offload (Dennis) - A use-after-free fix for io-wq (Hillf) - Cancelation fixes and improvements - Use proper files_struct references for offload - Cleanup of io_uring_get_socket() since that can now go into our own header - SQPOLL fixes and cleanups, and support for sharing the thread - Improvement to how page accounting is done for registered buffers and huge pages, accounting the real pinned state - Series cleaning up the xarray code (Willy) - Various cleanups, refactoring, and improvements (Pavel) - Use raw spinlock for io-wq (Sebastian) - Add support for ring restrictions (Stefano) * tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (62 commits) io_uring: keep a pointer ref_node in file_data io_uring: refactor *files_register()'s error paths io_uring: clean file_data access in files_register io_uring: don't delay io_init_req() error check io_uring: clean leftovers after splitting issue io_uring: remove timeout.list after hrtimer cancel io_uring: use a separate struct for timeout_remove io_uring: improve submit_state.ios_left accounting io_uring: simplify io_file_get() io_uring: kill extra check in fixed io_file_get() io_uring: clean up ->files grabbing io_uring: don't io_prep_async_work() linked reqs io_uring: Convert advanced XArray uses to the normal API io_uring: Fix XArray usage in io_uring_add_task_file io_uring: Fix use of XArray in __io_uring_files_cancel io_uring: fix break condition for __io_uring_register() waiting io_uring: no need to call xa_destroy() on empty xarray io_uring: batch account ->req_issue and task struct references io_uring: kill callback_head argument for io_req_task_work_add() io_uring: move req preps out of io_issue_sqe() ...	2020-10-13 12:36:21 -07:00
Linus Torvalds	85ed13e78d	Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull compat iovec cleanups from Al Viro: "Christoph's series around import_iovec() and compat variant thereof" * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: security/keys: remove compat_keyctl_instantiate_key_iov mm: remove compat_process_vm_{readv,writev} fs: remove compat_sys_vmsplice fs: remove the compat readv/writev syscalls fs: remove various compat readv/writev helpers iov_iter: transparently handle compat iovecs in import_iovec iov_iter: refactor rw_copy_check_uvector and import_iovec iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c compat.h: fix a spelling error in <linux/compat.h>	2020-10-12 16:35:51 -07:00
Pavel Begunkov	b2e9685283	io_uring: keep a pointer ref_node in file_data ->cur_refs of struct fixed_file_data always points to percpu_ref embedded into struct fixed_file_ref_node. Don't overuse container_of() and offsetting, and point directly to fixed_file_ref_node. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-10 12:49:25 -06:00
Pavel Begunkov	600cf3f8b3	io_uring: refactor *files_register()'s error paths Don't keep repeating cleaning sequences in error paths, write it once in the and use labels. It's less error prone and looks cleaner. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-10 12:49:25 -06:00
Pavel Begunkov	5398ae6985	io_uring: clean file_data access in files_register Keep file_data in a local var and replace with it complex references such as ctx->file_data. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-10 12:49:25 -06:00
Pavel Begunkov	692d836351	io_uring: don't delay io_init_req() error check Don't postpone io_init_req() error checks and do that right after calling it. There is no control-flow statements or dependencies with sqe/submitted accounting, so do those earlier, that makes the code flow a bit more natural. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-10 12:49:25 -06:00
Pavel Begunkov	062d04d731	io_uring: clean leftovers after splitting issue Kill extra if in io_issue_sqe() and place send/recv[msg] calls appropriately under switch's cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-10 12:49:25 -06:00
Pavel Begunkov	a71976f3fa	io_uring: remove timeout.list after hrtimer cancel Remove timeouts from ctx->timeout_list after hrtimer_try_to_cancel() successfully cancels it. With this we don't need to care whether there was a race and it was removed in io_timeout_fn(), and that will be handy for following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-10 12:49:25 -06:00
Pavel Begunkov	0bdf7a2ddb	io_uring: use a separate struct for timeout_remove Don't use struct io_timeout for both IORING_OP_TIMEOUT and IORING_OP_TIMEOUT_REMOVE, they're quite different. Split them in two, that allows to remove an unused field in struct io_timeout, and btw kill ->flags not used by either. This also easier to follow, especially for timeout remove. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-10 12:49:25 -06:00
Pavel Begunkov	71b547c048	io_uring: improve submit_state.ios_left accounting state->ios_left isn't decremented for requests that don't need a file, so it might be larger than number of SQEs left. That in some circumstances makes us to grab more files that is needed so imposing extra put. Deaccount one ios_left for each request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-10 12:49:25 -06:00
Pavel Begunkov	8371adf53c	io_uring: simplify io_file_get() Keep ->needs_file_no_error check out of io_file_get(), and let callers handle it. It makes it more straightforward. Also, as the only error it can hand back -EBADF, make it return a file or NULL. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-10 12:49:24 -06:00
Pavel Begunkov	479f517be5	io_uring: kill extra check in fixed io_file_get() ctx->nr_user_files == 0 IFF ctx->file_data == NULL and there fixed files are not used. Hence, verifying fds only against ctx->nr_user_files is enough. Remove the other check from hot path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-10 12:49:24 -06:00
Pavel Begunkov	233295130e	io_uring: clean up ->files grabbing Move work.files grabbing into io_prep_async_work() to all other work resources initialisation. We don't need to keep it separately now, as ->ring_fd/file are gone. It also allows to not grab it when a request is not going to io-wq. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-10 12:49:24 -06:00
Pavel Begunkov	5bf5e464f1	io_uring: don't io_prep_async_work() linked reqs There is no real reason left for preparing io-wq work context for linked requests in advance, remove it as this might become a bottleneck in some cases. Reported-by: Roman Gershman <romger@amazon.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-10 12:49:20 -06:00
Matthew Wilcox (Oracle)	5e2ed8c4f4	io_uring: Convert advanced XArray uses to the normal API There are no bugs here that I've spotted, it's just easier to use the normal API and there are no performance advantages to using the more verbose advanced API. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-09 09:00:05 -06:00
Matthew Wilcox (Oracle)	236434c343	io_uring: Fix XArray usage in io_uring_add_task_file The xas_store() wasn't paired with an xas_nomem() loop, so if it couldn't allocate memory using GFP_NOWAIT, it would leak the reference to the file descriptor. Also the node pointed to by the xas could be freed between the call to xas_load() under the rcu_read_lock() and the acquisition of the xa_lock. It's easier to just use the normal xa_load/xa_store interface here. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> [axboe: fix missing assign after alloc, cur_uring -> tctx rename] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-09 08:59:40 -06:00
Matthew Wilcox (Oracle)	ce765372bc	io_uring: Fix use of XArray in __io_uring_files_cancel We have to drop the lock during each iteration, so there's no advantage to using the advanced API. Convert this to a standard xa_for_each() loop. Reported-by: syzbot+27c12725d8ff0bfe1a13@syzkaller.appspotmail.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-09 08:52:26 -06:00
Jens Axboe	ed6930c920	io_uring: fix break condition for __io_uring_register() waiting Colin reports that there's unreachable code, since we only ever break if ret == 0. This is correct, and is due to a reversed logic condition in when to break or not. Break out of the loop if we don't process any task work, in that case we do want to return -EINTR. Fixes: `af9c1a44f8` ("io_uring: process task work in io_uring_register()") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-08 20:37:45 -06:00
Jens Axboe	ca6484cd30	io_uring: no need to call xa_destroy() on empty xarray The kernel test robot reports this lockdep issue: [child1:659] mbind (274) returned ENOSYS, marking as inactive. [child1:659] mq_timedsend (279) returned ENOSYS, marking as inactive. [main] 10175 iterations. [F:7781 S:2344 HI:2397] [ 24.610601] [ 24.610743] ================================ [ 24.611083] WARNING: inconsistent lock state [ 24.611437] 5.9.0-rc7-00017-g0f2122045b9462 #5 Not tainted [ 24.611861] -------------------------------- [ 24.612193] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 24.612660] ksoftirqd/0/7 [HC0[0]:SC1[3]:HE0:SE0] takes: [ 24.613086] f00ed998 (&xa->xa_lock#4){+.?.}-{2:2}, at: xa_destroy+0x43/0xc1 [ 24.613642] {SOFTIRQ-ON-W} state was registered at: [ 24.614024] lock_acquire+0x20c/0x29b [ 24.614341] _raw_spin_lock+0x21/0x30 [ 24.614636] io_uring_add_task_file+0xe8/0x13a [ 24.614987] io_uring_create+0x535/0x6bd [ 24.615297] io_uring_setup+0x11d/0x136 [ 24.615606] __ia32_sys_io_uring_setup+0xd/0xf [ 24.615977] do_int80_syscall_32+0x53/0x6c [ 24.616306] restore_all_switch_stack+0x0/0xb1 [ 24.616677] irq event stamp: 939881 [ 24.616968] hardirqs last enabled at (939880): [<8105592d>] __local_bh_enable_ip+0x13c/0x145 [ 24.617642] hardirqs last disabled at (939881): [<81b6ace3>] _raw_spin_lock_irqsave+0x1b/0x4e [ 24.618321] softirqs last enabled at (939738): [<81b6c7c8>] __do_softirq+0x3f0/0x45a [ 24.618924] softirqs last disabled at (939743): [<81055741>] run_ksoftirqd+0x35/0x61 [ 24.619521] [ 24.619521] other info that might help us debug this: [ 24.620028] Possible unsafe locking scenario: [ 24.620028] [ 24.620492] CPU0 [ 24.620685] ---- [ 24.620894] lock(&xa->xa_lock#4); [ 24.621168] <Interrupt> [ 24.621381] lock(&xa->xa_lock#4); [ 24.621695] [ 24.621695] * DEADLOCK * [ 24.621695] [ 24.622154] 1 lock held by ksoftirqd/0/7: [ 24.622468] #0: 823bfb94 (rcu_callback){....}-{0:0}, at: rcu_process_callbacks+0xc0/0x155 [ 24.623106] [ 24.623106] stack backtrace: [ 24.623454] CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 5.9.0-rc7-00017-g0f2122045b9462 #5 [ 24.624090] Call Trace: [ 24.624284] ? show_stack+0x40/0x46 [ 24.624551] dump_stack+0x1b/0x1d [ 24.624809] print_usage_bug+0x17a/0x185 [ 24.625142] mark_lock+0x11d/0x1db [ 24.625474] ? print_shortest_lock_dependencies+0x121/0x121 [ 24.625905] __lock_acquire+0x41e/0x7bf [ 24.626206] lock_acquire+0x20c/0x29b [ 24.626517] ? xa_destroy+0x43/0xc1 [ 24.626810] ? lock_acquire+0x20c/0x29b [ 24.627110] _raw_spin_lock_irqsave+0x3e/0x4e [ 24.627450] ? xa_destroy+0x43/0xc1 [ 24.627725] xa_destroy+0x43/0xc1 [ 24.627989] __io_uring_free+0x57/0x71 [ 24.628286] ? get_pid+0x22/0x22 [ 24.628544] __put_task_struct+0xf2/0x163 [ 24.628865] put_task_struct+0x1f/0x2a [ 24.629161] delayed_put_task_struct+0xe2/0xe9 [ 24.629509] rcu_process_callbacks+0x128/0x155 [ 24.629860] __do_softirq+0x1a3/0x45a [ 24.630151] run_ksoftirqd+0x35/0x61 [ 24.630443] smpboot_thread_fn+0x304/0x31a [ 24.630763] kthread+0x124/0x139 [ 24.631016] ? sort_range+0x18/0x18 [ 24.631290] ? kthread_create_worker_on_cpu+0x17/0x17 [ 24.631682] ret_from_fork+0x1c/0x28 which is complaining about xa_destroy() grabbing the xa lock in an IRQ disabling fashion, whereas the io_uring uses cases aren't interrupt safe. This is really an xarray issue, since it should not assume the lock type. But for our use case, since we know the xarray is empty at this point, there's no need to actually call xa_destroy(). So just get rid of it. Fixes: `0f2122045b` ("io_uring: don't rely on weak ->files references") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-08 07:46:52 -06:00
Jens Axboe	faf7b51c06	io_uring: batch account ->req_issue and task struct references Identical to how we handle the ctx reference counts, increase by the batch we're expecting to submit, and handle any slow path residual, if any. The request alloc-and-issue path is very hot, and this makes a noticeable difference by avoiding an two atomic incs for each individual request. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-07 12:55:42 -06:00
David S. Miller	8b0308fe31	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Rejecting non-native endian BTF overlapped with the addition of support for it. The rest were more simple overlapping changes, except the renesas ravb binding update, which had to follow a file move as well as a YAML conversion. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-10-05 18:40:01 -07:00
Christoph Hellwig	89cd35c58b	iov_iter: transparently handle compat iovecs in import_iovec Use in compat_syscall to import either native or the compat iovecs, and remove the now superflous compat_import_iovec. This removes the need for special compat logic in most callers, and the remaining ones can still be simplified by using __import_iovec with a bool compat parameter. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-10-03 00:02:13 -04:00
Linus Torvalds	702bfc891d	io_uring-5.9-2020-10-02 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl93Z48QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmp4EACwxi4UVnL0zhaOBmXfqxDuaXViwkfVZNxx d40y+DcCewnpZMk2G9cES8OKG+Tu2GFX2yl1m2XdrIWJ6jpnGFKJOkNQGfPDQrT3 fI7qFrEDeSVeLUMMBxtvZLW8w2D0KcNCgla4h/ESXI9xtPTZdYXhYQY0zfuWalUC ZplUgAWlHx82qJari7ZmIfeVtpAoujTvkccRe+/RtPv5vO+UsvP7kqPSCYMGqhHS 7z5gK3Nw+PNMWrzZVZ6Rw5nLeExx9PJGgiEkitEjn7mRJELXV9eWnTt9D0eVwaec WO7OSQmrJLmMFER4ZhkDNJkXZFvlYUCygnwJQmH70LflRqUEA00O6wX4J32O3NIg fIDWKMGGANFU5atL+RHqfQgUYq0GY1UsIvZxJnwRwv1QssmJoQq9fpT6VYqiQMik 2JAeWyMqTGI4vRNmVJKTR/13SpRUYrvS3wHN53kCaBBhE5Y/vFksgOGgXZBG/TPk odpegeJOTa5xuS0YcKIK6yL/xHENct1Y1BtVjczrXKJz0E90n5ZdIR0lEg6Ij3B1 jZUwKiS2sY09eBaJIQvtD4hIaw5VgqtwinKTyt7MBw/6pCqJpSZtaV0Uvgvjq/Se 1ifUo4cWwQBccZLgWeWoEalio2fNIyb+J+sm7eu9Xygjl67U2M8oMfAN2JjkM7As btLazer4lg== =fo3Z -----END PGP SIGNATURE----- Merge tag 'io_uring-5.9-2020-10-02' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: - fix for async buffered reads if read-ahead is fully disabled (Hao) - double poll match fix - ->show_fdinfo() potential ABBA deadlock complaint fix * tag 'io_uring-5.9-2020-10-02' of git://git.kernel.dk/linux-block: io_uring: fix async buffered reads when readahead is disabled io_uring: fix potential ABBA deadlock in ->show_fdinfo() io_uring: always delete double poll wait entry on match	2020-10-02 14:38:10 -07:00
Jens Axboe	87c4311fd2	io_uring: kill callback_head argument for io_req_task_work_add() We always use &req->task_work anyway, no point in passing it in. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 21:00:16 -06:00
Pavel Begunkov	c1379e247a	io_uring: move req preps out of io_issue_sqe() All request preparations are done only during submission, reflect it in the code by moving io_req_prep() much earlier into io_queue_sqe(). That's much cleaner, because it doen't expose bits to async code which it won't ever use. Also it makes the interface harder to misuse, and there are potential places for bugs. For instance, __io_queue() doesn't clear @sqe before proceeding to a next linked request, that could have been disastrous, but hopefully there are linked requests IFF sqe==NULL, so not actually a bug. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:38:46 -06:00
Pavel Begunkov	bfe7655983	io_uring: decouple issuing and req preparation io_issue_sqe() does two things at once, trying to prepare request and issuing them. Split it in two and deduplicate with io_defer_prep(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:38:46 -06:00
Pavel Begunkov	73debe68b3	io_uring: remove nonblock arg from io_{rw}_prep() All io_*_prep() functions including io_{read,write}_prep() are called only during submission where @force_nonblock is always true. Don't keep propagating it and instead remove the @force_nonblock argument from prep() altogether. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:38:46 -06:00
Pavel Begunkov	a88fc40021	io_uring: set/clear IOCB_NOWAIT into io_read/write Move setting IOCB_NOWAIT from io_prep_rw() into io_read()/io_write(), so it's set/cleared in a single place. Also remove @force_nonblock parameter from io_prep_rw(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:38:46 -06:00
Pavel Begunkov	2d199895d2	io_uring: remove F_NEED_CLEANUP check in prep() REQ_F_NEED_CLEANUP is set only by io__prep() and they're guaranteed to be called only once, so there is no one who may have set the flag before. Kill REQ_F_NEED_CLEANUP check in these *prep() handlers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:38:46 -06:00
Pavel Begunkov	5b09e37e27	io_uring: io_kiocb_ppos() style change Put brackets around bitwise ops in a complex expression Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:38:45 -06:00
Pavel Begunkov	291b2821e0	io_uring: simplify io_alloc_req() Extract common code from if/else branches. That is cleaner and optimised even better. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:38:45 -06:00
Joseph Qi	dbbe9c6424	io_uring: show sqthread pid and cpu in fdinfo In most cases we'll specify IORING_SETUP_SQPOLL and run multiple io_uring instances in a host. Since all sqthreads are named "io_uring-sq", it's hard to distinguish the relations between application process and its io_uring sqthread. With this patch, application can get its corresponding sqthread pid and cpu through show_fdinfo. Steps: 1. Get io_uring fd first. $ ls -l /proc/<pid>/fd \| grep -w io_uring 2. Then get io_uring instance related info, including corresponding sqthread pid and cpu. $ cat /proc/<pid>/fdinfo/<io_uring_fd> pos: 0 flags: 02000002 mnt_id: 13 SqThread: 6929 SqThreadCpu: 2 UserFiles: 1 0: testfile UserBufs: 0 PollList: Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> [axboe: fixed for new shared SQPOLL] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Jens Axboe	af9c1a44f8	io_uring: process task work in io_uring_register() We do this for CQ ring wait, in case task_work completions come in. We should do the same in io_uring_register(), to avoid spurious -EINTR if the ring quiescing ends up having to process task_work to complete the operation Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Dennis Zhou	91d8f5191e	io_uring: add blkcg accounting to offloaded operations There are a few operations that are offloaded to the worker threads. In this case, we lose process context and end up in kthread context. This results in ios to be not accounted to the issuing cgroup and consequently end up as issued by root. Just like others, adopt the personality of the blkcg too when issuing via the workqueues. For the SQPOLL thread, it will live and attach in the inited cgroup's context. Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Jens Axboe	de2939388b	io_uring: improve registered buffer accounting for huge pages io_uring does account any registered buffer as pinned/locked memory, and checks limit and fails if the given user doesn't have a big enough limit to register the ranges specified. However, if huge pages are used, we are potentially under-accounting the memory in terms of what gets pinned on the vm side. This patch rectifies that, by ensuring that we account the full size of a compound page, regardless of how much of it is being registered. Huge pages are not accounted mulitple times - if multiple sections of a huge page is registered, then the page is only accounted once. Reported-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Zheng Bin	14db84110d	io_uring: remove unneeded semicolon Fixes coccicheck warning: fs/io_uring.c:4242:13-14: Unneeded semicolon Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Jens Axboe	e95eee2dee	io_uring: cap SQ submit size for SQPOLL with multiple rings In the spirit of fairness, cap the max number of SQ entries we'll submit for SQPOLL if we have multiple rings. If we don't do that, we could be submitting tons of entries for one ring, while others are waiting to get service. The value of 8 is somewhat arbitrarily chosen as something that allows a fair bit of batching, without using an excessive time per ring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Jens Axboe	e8c2bc1fb6	io_uring: get rid of req->io/io_async_ctx union There's really no point in having this union, it just means that we're always allocating enough room to cater to any command. But that's pointless, as the ->io field is request type private anyway. This gets rid of the io_async_ctx structure, and fills in the required size in the io_op_defs[] instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Pavel Begunkov	4be1c61512	io_uring: kill extra user_bufs check Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary, because in that case ctx->nr_user_bufs would be zero, and the following check would fail. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:34 -06:00
Pavel Begunkov	ab0b196ce5	io_uring: fix overlapped memcpy in io_req_map_rw() When io_req_map_rw() is called from io_rw_prep_async(), it memcpy() iorw->iter into itself. Even though it doesn't lead to an error, such a memcpy()'s aliasing rules violation is considered to be a bad practise. Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any remapping there, so it's much simpler than the generic implementation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Pavel Begunkov	afb87658f8	io_uring: refactor io_req_map_rw() Set rw->free_iovec to @iovec, that gives an identical result and stresses that @iovec param rw->free_iovec play the same role. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Pavel Begunkov	f4bff104ff	io_uring: simplify io_rw_prep_async() Don't touch iter->iov and iov in between __io_import_iovec() and io_req_map_rw(), the former function aleady sets it correctly, because it creates one more case with NULL'ed iov to consider in io_req_map_rw(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	9055420072	io_uring: provide IORING_ENTER_SQ_WAIT for SQPOLL SQ ring waits When using SQPOLL, applications can run into the issue of running out of SQ ring entries because the thread hasn't consumed them yet. The only option for dealing with that is checking later, or busy checking for the condition. Provide IORING_ENTER_SQ_WAIT if applications want to wait on this condition. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	738277adc8	io_uring: mark io_uring_fops/io_op_defs as __read_mostly These structures are never written, move them appropriately. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	aa06165de8	io_uring: enable IORING_SETUP_ATTACH_WQ to attach to SQPOLL thread too We support using IORING_SETUP_ATTACH_WQ to share async backends between rings created by the same process, this now also allows the same to happen with SQPOLL. The setup procedure remains the same, the caller sets io_uring_params->wq_fd to the 'parent' context, and then the newly created ring will attach to that async backend. This means that multiple rings can share the same SQPOLL thread, saving resources. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	69fb21310f	io_uring: base SQPOLL handling off io_sq_data Remove the SQPOLL thread from the ctx, and use the io_sq_data as the data structure we pass in. io_sq_data has a list of ctx's that we can then iterate over and handle. As of now we're ready to handle multiple ctx's, though we're still just handling a single one after this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	534ca6d684	io_uring: split SQPOLL data into separate structure Move all the necessary state out of io_ring_ctx, and into a new structure, io_sq_data. The latter now deals with any state or variables associated with the SQPOLL thread itself. In preparation for supporting more than one io_ring_ctx per SQPOLL thread. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	c8d1ba583f	io_uring: split work handling part of SQPOLL into helper This is done in preparation for handling more than one ctx, but it also cleans up the code a bit since io_sq_thread() was a bit too unwieldy to get a get overview on. __io_sq_thread() is now the main handler, and it returns an enum sq_ret that tells io_sq_thread() what it ended up doing. The parent then makes a decision on idle, spinning, or work handling based on that. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	3f0e64d054	io_uring: move SQPOLL post-wakeup ring need wakeup flag into wake handler We need to decouple the clearing on wakeup from the the inline schedule, as that is going to be required for handling multiple rings in one thread. Wrap our wakeup handler so we can clear it when we get the wakeup, by definition that is when we no longer need the flag set. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	6a7793828f	io_uring: use private ctx wait queue entries for SQPOLL This is in preparation to sharing the poller thread between rings. For that we need per-ring wait_queue_entry storage, and we can't easily put that on the stack if one thread is managing multiple rings. We'll also be sharing the wait_queue_head across rings for the purposes of wakeups, provide the usual private ring wait_queue_head for now but make it a pointer so we can easily override it when sharing. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	e35afcf912	io_uring: io_sq_thread() doesn't need to flush signals We're not handling signals by default in kernel threads, and we never use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never have a signal pending, and we don't need to check for it (nor flush it). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Stefano Garzarella	7e84e1c756	io_uring: allow disabling rings during the creation This patch adds a new IORING_SETUP_R_DISABLED flag to start the rings disabled, allowing the user to register restrictions, buffers, files, before to start processing SQEs. When IORING_SETUP_R_DISABLED is set, SQE are not processed and SQPOLL kthread is not started. The restrictions registration are allowed only when the rings are disable to prevent concurrency issue while processing SQEs. The rings can be enabled using IORING_REGISTER_ENABLE_RINGS opcode with io_uring_register(2). Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Stefano Garzarella	21b55dbc06	io_uring: add IOURING_REGISTER_RESTRICTIONS opcode The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently installs a feature allowlist on an io_ring_ctx. The io_ring_ctx can then be passed to untrusted code with the knowledge that only operations present in the allowlist can be executed. The allowlist approach ensures that new features added to io_uring do not accidentally become available when an existing application is launched on a newer kernel version. Currently is it possible to restrict sqe opcodes, sqe flags, and register opcodes. IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards it is not possible to change restrictions anymore. This prevents untrusted code from removing restrictions. Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:33 -06:00
Jens Axboe	9b82849215	io_uring: reference ->nsproxy for file table commands If we don't get and assign the namespace for the async work, then certain paths just don't work properly (like /dev/stdin, /proc/mounts, etc). Anything that references the current namespace of the given task should be assigned for async work on behalf of that task. Cc: stable@vger.kernel.org # v5.5+ Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	0f2122045b	io_uring: don't rely on weak ->files references Grab actual references to the files_struct. To avoid circular references issues due to this, we add a per-task note that keeps track of what io_uring contexts a task has used. When the tasks execs or exits its assigned files, we cancel requests based on this tracking. With that, we can grab proper references to the files table, and no longer need to rely on stashing away ring_fd and ring_file to check if the ring_fd may have been closed. Cc: stable@vger.kernel.org # v5.5+ Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	e6c8aa9ac3	io_uring: enable task/files specific overflow flushing This allows us to selectively flush out pending overflows, depending on the task and/or files_struct being passed in. No intended functional changes in this patch. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	76e1b6427f	io_uring: return cancelation status from poll/timeout/files handlers Return whether we found and canceled requests or not. This is in preparation for using this information, no functional changes in this patch. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	e3bc8e9dad	io_uring: unconditionally grab req->task Sometimes we assign a weak reference to it, sometimes we grab a reference to it. Clean this up and make it unconditional, and drop the flag related to tracking this state. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	2aede0e417	io_uring: stash ctx task reference for SQPOLL We can grab a reference to the task instead of stashing away the task files_struct. This is doable without creating a circular reference between the ring fd and the task itself. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	f573d38445	io_uring: move dropping of files into separate helper No functional changes in this patch, prep patch for grabbing references to the files_struct. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	f3606e3a92	io_uring: allow timeout/poll/files killing to take task into account We currently cancel these when the ring exits, and we cancel all of them. This is in preparation for killing only the ones associated with a given task. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-30 20:32:32 -06:00
Jens Axboe	0f07889691	Merge branch 'io_uring-5.9' into for-5.10/io_uring * io_uring-5.9: io_uring: fix async buffered reads when readahead is disabled io_uring: fix potential ABBA deadlock in ->show_fdinfo() io_uring: always delete double poll wait entry on match	2020-09-30 20:32:25 -06:00
Hao Xu	c8d317aa18	io_uring: fix async buffered reads when readahead is disabled The async buffered reads feature is not working when readahead is turned off. There are two things to concern: - when doing retry in io_read, not only the IOCB_WAITQ flag but also the IOCB_NOWAIT flag is still set, which makes it goes to would_block phase in generic_file_buffered_read() and then return -EAGAIN. After that, the io-wq thread work is queued, and later doing the async reads in the old way. - even if we remove IOCB_NOWAIT when doing retry, the feature is still not running properly, since in generic_file_buffered_read() it goes to lock_page_killable() after calling mapping->a_ops->readpage() to do IO, and thus causing process to sleep. Fixes: `1a0a7853b9` ("mm: support async buffered reads in generic_file_buffered_read()") Fixes: `3b2a4439e0` ("io_uring: get rid of kiocb_wait_page_queue_init()") Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-29 07:54:00 -06:00
Jens Axboe	fad8e0de44	io_uring: fix potential ABBA deadlock in ->show_fdinfo() syzbot reports a potential lock deadlock between the normal IO path and ->show_fdinfo(): ====================================================== WARNING: possible circular locking dependency detected 5.9.0-rc6-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor.2/19710 is trying to acquire lock: ffff888098ddc450 (sb_writers#4){.+.+}-{0:0}, at: io_write+0x6b5/0xb30 fs/io_uring.c:3296 but task is already holding lock: ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&ctx->uring_lock){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:956 [inline] __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103 __io_uring_show_fdinfo fs/io_uring.c:8417 [inline] io_uring_show_fdinfo+0x194/0xc70 fs/io_uring.c:8460 seq_show+0x4a8/0x700 fs/proc/fd.c:65 seq_read+0x432/0x1070 fs/seq_file.c:208 do_loop_readv_writev fs/read_write.c:734 [inline] do_loop_readv_writev fs/read_write.c:721 [inline] do_iter_read+0x48e/0x6e0 fs/read_write.c:955 vfs_readv+0xe5/0x150 fs/read_write.c:1073 kernel_readv fs/splice.c:355 [inline] default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412 do_splice_to+0x137/0x170 fs/splice.c:871 splice_direct_to_actor+0x307/0x980 fs/splice.c:950 do_splice_direct+0x1b3/0x280 fs/splice.c:1059 do_sendfile+0x55f/0xd40 fs/read_write.c:1540 __do_sys_sendfile64 fs/read_write.c:1601 [inline] __se_sys_sendfile64 fs/read_write.c:1587 [inline] __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&p->lock){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:956 [inline] __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103 seq_read+0x61/0x1070 fs/seq_file.c:155 pde_read fs/proc/inode.c:306 [inline] proc_reg_read+0x221/0x300 fs/proc/inode.c:318 do_loop_readv_writev fs/read_write.c:734 [inline] do_loop_readv_writev fs/read_write.c:721 [inline] do_iter_read+0x48e/0x6e0 fs/read_write.c:955 vfs_readv+0xe5/0x150 fs/read_write.c:1073 kernel_readv fs/splice.c:355 [inline] default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412 do_splice_to+0x137/0x170 fs/splice.c:871 splice_direct_to_actor+0x307/0x980 fs/splice.c:950 do_splice_direct+0x1b3/0x280 fs/splice.c:1059 do_sendfile+0x55f/0xd40 fs/read_write.c:1540 __do_sys_sendfile64 fs/read_write.c:1601 [inline] __se_sys_sendfile64 fs/read_write.c:1587 [inline] __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (sb_writers#4){.+.+}-{0:0}: check_prev_add kernel/locking/lockdep.c:2496 [inline] check_prevs_add kernel/locking/lockdep.c:2601 [inline] validate_chain kernel/locking/lockdep.c:3218 [inline] __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441 lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write+0x228/0x450 fs/super.c:1672 io_write+0x6b5/0xb30 fs/io_uring.c:3296 io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719 __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175 io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254 io_submit_sqe fs/io_uring.c:6324 [inline] io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521 __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: sb_writers#4 --> &p->lock --> &ctx->uring_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ctx->uring_lock); lock(&p->lock); lock(&ctx->uring_lock); lock(sb_writers#4); * DEADLOCK * 1 lock held by syz-executor.2/19710: #0: ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348 stack backtrace: CPU: 0 PID: 19710 Comm: syz-executor.2 Not tainted 5.9.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827 check_prev_add kernel/locking/lockdep.c:2496 [inline] check_prevs_add kernel/locking/lockdep.c:2601 [inline] validate_chain kernel/locking/lockdep.c:3218 [inline] __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441 lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write+0x228/0x450 fs/super.c:1672 io_write+0x6b5/0xb30 fs/io_uring.c:3296 io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719 __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175 io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254 io_submit_sqe fs/io_uring.c:6324 [inline] io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521 __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45e179 Code: 3d b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 0b b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f1194e74c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa RAX: ffffffffffffffda RBX: 00000000000082c0 RCX: 000000000045e179 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004 RBP: 000000000118cf98 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118cf4c R13: 00007ffd1aa5756f R14: 00007f1194e759c0 R15: 000000000118cf4c Fix this by just not diving into details if we fail to trylock the io_uring mutex. We know the ctx isn't going away during this operation, but we cannot safely iterate buffers/files/personalities if we don't hold the io_uring mutex. Reported-by: syzbot+2f8fa4e860edc3066aba@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-28 09:06:08 -06:00
Jens Axboe	8706e04ed7	io_uring: always delete double poll wait entry on match syzbot reports a crash with tty polling, which is using the double poll handling: general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f] CPU: 0 PID: 6874 Comm: syz-executor749 Not tainted 5.9.0-rc6-next-20200924-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:io_poll_get_single fs/io_uring.c:4778 [inline] RIP: 0010:io_poll_double_wake+0x51/0x510 fs/io_uring.c:4845 Code: fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 9e 03 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 5d 08 48 8d 7b 48 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 06 0f 8e 63 03 00 00 0f b6 6b 48 bf 06 00 00 RSP: 0018:ffffc90001c1fb70 EFLAGS: 00010006 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000004 RDX: 0000000000000009 RSI: ffffffff81d9b3ad RDI: 0000000000000048 RBP: dffffc0000000000 R08: ffff8880a3cac798 R09: ffffc90001c1fc60 R10: fffff52000383f73 R11: 0000000000000000 R12: 0000000000000004 R13: ffff8880a3cac798 R14: ffff8880a3cac7a0 R15: 0000000000000004 FS: 0000000001f98880(0000) GS:ffff8880ae400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f18886916c0 CR3: 0000000094c5a000 CR4: 00000000001506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __wake_up_common+0x147/0x650 kernel/sched/wait.c:93 __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:123 tty_ldisc_hangup+0x1cf/0x680 drivers/tty/tty_ldisc.c:735 __tty_hangup.part.0+0x403/0x870 drivers/tty/tty_io.c:625 __tty_hangup drivers/tty/tty_io.c:575 [inline] tty_vhangup+0x1d/0x30 drivers/tty/tty_io.c:698 pty_close+0x3f5/0x550 drivers/tty/pty.c:79 tty_release+0x455/0xf60 drivers/tty/tty_io.c:1679 __fput+0x285/0x920 fs/file_table.c:281 task_work_run+0xdd/0x190 kernel/task_work.c:141 tracehook_notify_resume include/linux/tracehook.h:188 [inline] exit_to_user_mode_loop kernel/entry/common.c:165 [inline] exit_to_user_mode_prepare+0x1e2/0x1f0 kernel/entry/common.c:192 syscall_exit_to_user_mode+0x7a/0x2c0 kernel/entry/common.c:267 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x401210 which is due to a failure in removing the double poll wait entry if we hit a wakeup match. This can cause multiple invocations of the wakeup, which isn't safe. Cc: stable@vger.kernel.org # v5.8 Reported-by: syzbot+81b3883093f772addf6d@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-28 08:38:54 -06:00
Linus Torvalds	692495baa3	io_uring-5.9-2020-09-25 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9upV4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuPrD/9K1UQLv38K2nPYclLymOi+GIsukpjgwzdY SM38GNXU5vYkFhylH/bXfckNQ/gja0/whNpcr/UVCgTWleMnss9UiZaCgysyuIOL vnBxT4yDZIxtkOwF/790NiwV2FrsmrLFdNZU4LkmfbmmrAlNtjOElKyJsOyNNMzJ UMzHH2Z1vvUwKz+Yq4fPyZCJbpNHN6ABwkSXY/Nz8oWsKfw728fZztLsP57gOtkl yYVFO2z1n7VaWp5ZzVFYG51DFuMCIDXgN6mMlaKfnQ6auQZxjR+jg69HSRKLjIx7 ZEG1gl/DzwH1+751P7HnuI3U7BtBYolyErHW4j4a6Ko4XX8PPhG9ODKOmsEMPrEq gCUGcGgWUsEyz+pyullTEt7ea/oLGJ5N86qtNOdviXETZZTghm47QlzxFWr1/GWy wH++ctBf/Ekk0dbCBF6mJiqDl/PrVSDSClTVhsGJESEmk4BOoC9zd9zT/EfsiR9m vA8wLE2g1/5oU+irQ0Dlc/EENVWISiigOFFvTPJJjma9iGXAW3kV2/aYW6DKZSwM w/va7zTlzt89O+L0AT+Rg8btaiTiaZcs3op1AFa1z5Gut3b5YhWL/e95wlaOI6Nv Tudm4GX06BaN1QdUDV9g0Pr5iNOaCvuOArjNOU3j7ySusJxiJ8GdA3WFqJ/XUlIV pne8hC/+7A== =mBw7 -----END PGP SIGNATURE----- Merge tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Two fixes for regressions in this cycle, and one that goes to 5.8 stable: - fix leak of getname() retrieved filename - remove plug->nowait assignment, fixing a regression with btrfs - fix for async buffered retry" * tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block: io_uring: ensure async buffered read-retry is setup properly io_uring: don't unconditionally set plug->nowait = true io_uring: ensure open/openat2 name is cleaned on cancelation	2020-09-26 11:13:51 -07:00
Jens Axboe	f38c7e3abf	io_uring: ensure async buffered read-retry is setup properly A previous commit for fixing up short reads botched the async retry path, so we ended up going to worker threads more often than we should. Fix this up, so retries work the way they originally were intended to. Fixes: `227c0c9673` ("io_uring: internally retry short reads") Reported-by: Hao_Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-25 15:39:13 -06:00
Jens Axboe	62c774ed48	io_uring: don't unconditionally set plug->nowait = true This causes all the bios to be submitted with REQ_NOWAIT, which can be problematic on either btrfs or on file systems that otherwise use a mix of block devices where only some of them support it. For now, just remove the setting of plug->nowait = true. Reported-by: Dan Melnic <dmm@fb.com> Reported-by: Brian Foster <bfoster@redhat.com> Fixes: `b63534c41e` ("io_uring: re-issue block requests that failed because of resources") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-25 09:01:53 -06:00
Jens Axboe	f3cd485050	io_uring: ensure open/openat2 name is cleaned on cancelation If we cancel these requests, we'll leak the memory associated with the filename. Add them to the table of ops that need cleaning, if REQ_F_NEED_CLEANUP is set. Cc: stable@vger.kernel.org Fixes: `e62753e4e2` ("io_uring: call statx directly") Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-25 07:41:46 -06:00
David S. Miller	3ab0a7a0c3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Two minor conflicts: 1) net/ipv4/route.c, adding a new local variable while moving another local variable and removing it's initial assignment. 2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes. One pretty prints the port mode differently, whilst another changes the driver to try and obtain the port mode from the port node rather than the switch node. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-22 16:45:34 -07:00
Linus Torvalds	0baca07006	io_uring-5.9-2020-09-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9qLpQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpk/qD/0dj9STzEMkUsbl2XA5oifF2NVn6VHMidJ3 Ukdhoy4ihh2UFBFO2VZv2UNZ7o4Zt53TA3ha+fB0EL7I23g86XTOItTWd+JHOGpI M11JejYTxcSUzPVrPfd/2PJ/Tqx+ld4ojTxH8noS4hx7FgueSuRR80UU5gfLGAmr e7A7vHD8tr9ZoqNcyVVCYa0/80gUbxh1wYOMvqaE6dSPITe96keGKmmk8hRA8kQo SBfbZeEqf2oErlM0dTVOd34rZbQQyRuMpDmLuc/g6RNMFVPyBqEvQmGwqOtWNe4q RFS9/imQA1Wi1OD15NoDx0C7BGovmT53xfXpnqI3lXzywxSDGhGVQd0E8Udp6zha xszrFlQEqS4OFZrHK6B+tnJBFFBZ8jN0K3ZlHpO8QH83OGvyr2k/RokoHFWMTSYh +5pHRd+6p7o8traQ6h0MJXmacIxZ0hQdJPuawRjAnziBgRhMV2FMLAXgYHtWl0AD wUiBWUEIV9PP0phu78X2TxvB9L7CPjuv7orJ8Q5dBSkQc7i33ESYMe8Mix85CFm+ SQcazoQE7VLL175TN/FdDDKkBeyAsob9TjeEazb04Vywy0vHW+MGrSOescCBDLF7 RRDRE0E12Ur9BTVTBi/MJsXT2xtufxN2YU368ZX78RYwgI4r9lx4LZZDte3h9/gs xEPXk5vuzg== =ImBG -----END PGP SIGNATURE----- Merge tag 'io_uring-5.9-2020-09-22' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "A few fixes - most of them regression fixes from this cycle, but also a few stable heading fixes, and a build fix for the included demo tool since some systems now actually have gettid() available" * tag 'io_uring-5.9-2020-09-22' of git://git.kernel.dk/linux-block: io_uring: fix openat/openat2 unified prep handling io_uring: mark statx/files_update/epoll_ctl as non-SQPOLL tools/io_uring: fix compile breakage io_uring: don't use retry based buffered reads for non-async bdev io_uring: don't re-setup vecs/iter in io_resumit_prep() is already there io_uring: don't run task work on an exiting task io_uring: drop 'ctx' ref on task work cancelation io_uring: grab any needed state during defer prep	2020-09-22 14:36:50 -07:00
Jens Axboe	4eb8dded6b	io_uring: fix openat/openat2 unified prep handling A previous commit unified how we handle prep for these two functions, but this means that we check the allowed context (SQPOLL, specifically) later than we should. Move the ring type checking into the two parent functions, instead of doing it after we've done some setup work. Fixes: `ec65fea5a8` ("io_uring: deduplicate io_openat{,2}_prep()") Reported-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-21 07:51:03 -06:00
Jens Axboe	6ca56f8459	io_uring: mark statx/files_update/epoll_ctl as non-SQPOLL These will naturally fail when attempted through SQPOLL, but either with -EFAULT or -EBADF. Make it explicit that these are not workable through SQPOLL and return -EINVAL, just like other ops that need to use ->files. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-21 07:51:00 -06:00
Jens Axboe	f5cac8b156	io_uring: don't use retry based buffered reads for non-async bdev Some block devices, like dm, bubble back -EAGAIN through the completion handler. We check for this in io_read(), but don't honor it for when we have copied the iov. Return -EAGAIN for this case before retrying, to force punt to io-wq. Fixes: `bcf5a06304` ("io_uring: support true async buffered reads, if file provides it") Reported-by: Zorro Lang <zlang@redhat.com> Tested-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-21 07:50:56 -06:00
Jens Axboe	8f3d749685	io_uring: don't re-setup vecs/iter in io_resumit_prep() is already there If we already have mapped the necessary data for retry, then don't set it up again. It's a pointless operation, and we leak the iovec if it's a large (non-stack) vec. Fixes: `b63534c41e` ("io_uring: re-issue block requests that failed because of resources") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-21 07:50:54 -06:00
Jens Axboe	6200b0ae4e	io_uring: don't run task work on an exiting task This isn't safe, and isn't needed either. We are guaranteed that any work we queue is on a live task (and will be run), or it goes to our backup io-wq threads if the task is exiting. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-09-14 10:22:15 -06:00

1 2 3 4 5 ...

973 commits