Commit graph

71819 commits

Author SHA1 Message Date
Pavel Begunkov
21c843d582 io_uring: move req_ref_get() and friends
Move all request refcount helpers to avoid forward declarations in the
future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/89fd36f6f3fe5b733dfe4546c24725eee40df605.1628705069.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:10:32 -06:00
Jens Axboe
79ebeaee8a io_uring: remove IRQ aspect of io_ring_ctx completion lock
We have no hard/soft IRQ users of this lock left, remove any IRQ
disabling/saving and restoring when grabbing this lock.

This is straight forward with no users entering with IRQs disabled
anymore, the only thing to look out for is the waitqueue poll head
lock which nests inside the completion lock. That needs IRQs disabled,
and hence we have to do that now instead of relying on the outer lock
doing so.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:10:32 -06:00
Jens Axboe
8ef12efe26 io_uring: run regular file completions from task_work
This is in preparation to making the completion lock work outside of
hard/soft IRQ context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:10:32 -06:00
Jens Axboe
89b263f6d5 io_uring: run linked timeouts from task_work
This is in preparation to making the completion lock work outside of
hard/soft IRQ context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:10:32 -06:00
Jens Axboe
89850fce16 io_uring: run timeouts from task_work
This is in preparation to making the completion lock work outside of
hard/soft IRQ context.

Add a timeout_lock to handle the ordering of timeout completions or
cancelations with the timeouts actually triggering.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:10:32 -06:00
Pavel Begunkov
62906e89e6 io_uring: remove file batch-get optimisation
For requests with non-fixed files, instead of grabbing just one
reference, we get by the number of left requests, so the following
requests using the same file can take it without atomics.

However, it's not all win. If there is one request in the middle
not using files or having a fixed file, we'll need to put back the left
references. Even worse if an application submits requests dealing with
different files, it will do a put for each new request, so doubling the
number of atomics needed. Also, even if not used, it's still takes some
cycles in the submission path.

If a file used many times, it rather makes sense to pre-register it, if
not, we may fall in the described pitfall. So, this optimisation is a
matter of use case. Go with the simpliest code-wise way, remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:10:32 -06:00
Pavel Begunkov
6294f3686b io_uring: clean up tctx_task_work()
After recent fixes, tctx_task_work() always does proper spinlocking
before looking into ->task_list, so now we don't need atomics for
->task_state, replace it with non-atomic task_running using the critical
section.

Tide it up, combine two separate block with spinlocking, and always try
to splice in there, so we do less locking when new requests are arriving
during the function execution.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: fix missing ->task_running reset on task_work_add() failure]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:10:32 -06:00
Pavel Begunkov
5d70904367 io_uring: inline io_poll_remove_waitqs
Inline io_poll_remove_waitqs() into its only user and clean it up.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2f1a91a19ffcd591531dc4c61e2f11c64a2d6a6d.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:10:26 -06:00
Pavel Begunkov
90f67366cb io_uring: remove extra argument for overflow flush
Unlike __io_cqring_overflow_flush(), nobody does forced flushing with
io_cqring_overflow_flush(), so removed the argument from it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7594f869ca41b7cfb5a35a3c7c2d402242834e9e.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:10:19 -06:00
Pavel Begunkov
cd0ca2e048 io_uring: inline struct io_comp_state
Inline struct io_comp_state into struct io_submit_state. They are
already coupled tightly, together with mixed responsibilities it
only brings confusion having them separately.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e55bba77426b399e3a2e54e3c6c267c6a0fc4b57.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:09:48 -06:00
Pavel Begunkov
bb943b8265 io_uring: use inflight_entry instead of compl.list
req->compl.list is used to cache freed requests, and so can't overlap in
time with req->inflight_entry. So, use inflight_entry to link requests
and remove compl.list.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e430e79d22d70a190d718831bda7bfed1daf8976.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:09:43 -06:00
Pavel Begunkov
7255834ed6 io_uring: remove redundant args from cache_free
We don't use @tsk argument of io_req_cache_free(), remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6a28b4a58ee0aaf0db98e2179b9c9f06f9b0cca1.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:09:43 -06:00
Pavel Begunkov
c34b025f2d io_uring: cache __io_free_req()'d requests
Don't kfree requests in __io_free_req() but put them back into the
internal request cache. That makes allocations more sustainable and will
be used for refcounting optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9f4950fbe7771c8d41799366d0a3a08ac3040236.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:09:43 -06:00
Pavel Begunkov
f56165e62f io_uring: move io_fallback_req_func()
Move io_fallback_req_func() to kill yet another forward declaration.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d0a8f9d9a0057ed761d6237167d51c9378798d2d.1628536684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:09:22 -06:00
Pavel Begunkov
e9dbe221f5 io_uring: optimise putting task struct
We cache all the reference to task + tctx, so if io_put_task() is
called by the corresponding task itself, we can save on atomics and
return the refs right back into the cache.

It's beneficial for all inline completions, and also iopolling, when
polling and submissions are done by the same task, including
SQPOLL|IOPOLL.

Note: io_uring_cancel_generic() can return refs to the cache as well,
so those should be flushed in the loop for tctx_inflight() to work
right.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6fe9646b3cb70e46aca1f58426776e368c8926b3.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:08:06 -06:00
Pavel Begunkov
af066f31eb io_uring: drop exec checks from io_req_task_submit
In case of on-exec io_uring cancellations, tasks already wait for all
submitted requests to get completed/cancelled, so we don't need to check
for ->in_execve separately.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/be8707049f10df9d20ca03dc4ca3316239b5e8e0.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:08:06 -06:00
Pavel Begunkov
bbbca09489 io_uring: kill unused IO_IOPOLL_BATCH
IO_IOPOLL_BATCH is not used, delete it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2bdf19dbee2c9fc8865bbab9412135a14e24a64.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:08:06 -06:00
Pavel Begunkov
58d3be2c60 io_uring: improve ctx hang handling
If io_ring_exit_work() can't get it done in 5 minutes, something is
going very wrong, don't keep spinning at HZ / 20 rate, it doesn't help
and it may take much of CPU time if there is a lot of workers stuck as
such.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9e2d1ca81d569f6bc628af1a42ff6663bff7ce9c.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:59 -06:00
Pavel Begunkov
d3fddf6ddd io_uring: deduplicate open iopoll check
Move IORING_SETUP_IOPOLL check into __io_openat_prep(), so both openat
and openat2 reuse it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9a73ce83e4ee60d011180ef177eecef8e87ff2a2.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:59 -06:00
Pavel Begunkov
543af3a13d io_uring: inline io_free_req_deferred
Inline io_free_req_deferred(), there is no reason to keep it separated.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ce04b7180d4eac0d69dd00677b227eefe80c2cc5.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:59 -06:00
Pavel Begunkov
b9bd2bea0f io_uring: move io_rsrc_node_alloc() definition
Move the function together with io_rsrc_node_ref_zero() in the source
file as it is to get rid of forward declarations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d81f6f833e7d017860b24463a9a68b14a8a5ed2.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:59 -06:00
Pavel Begunkov
6a290a1442 io_uring: move io_put_task() definition
Move the function in the source file as it is to get rid of forward
declarations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/33d917d69e4206557c75a5b98fe22bcdf77ce47d.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:59 -06:00
Pavel Begunkov
e73c5c7cd3 io_uring: extract a helper for ctx quiesce
Refactor __io_uring_register() by extracting a helper responsible for
ctx queisce. Looks better and will make it easier to add more
optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0339e0027504176be09237eefa7945bf9a6f153d.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:59 -06:00
Pavel Begunkov
90291099f2 io_uring: optimise io_cqring_wait() hot path
Turns out we always init struct io_wait_queue in io_cqring_wait(), even
if it's not used after, i.e. there are already enough of CQEs. And often
it's exactly what happens, for instance, requests may have been
completed inline, or in case of io_uring_enter(submit=N, wait=1).

It shows up in my profiler, so optimise it by delaying the struct init.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6f1b81c60b947d165583dc333947869c3d85d037.1628471125.git.asml.silence@gmail.com
[axboe: fixed up for new cqring wait]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:59 -06:00
Pavel Begunkov
282cdc8693 io_uring: add more locking annotations for submit
Add more annotations for submission path functions holding ->uring_lock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/128ec4185e26fbd661dd3a424aa66108ee8ff951.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:59 -06:00
Pavel Begunkov
a2416e1ec2 io_uring: don't halt iopoll too early
IOPOLL users should care more about getting completions for requests
they submitted, but not in "device did/completed something". Currently,
io_do_iopoll() may return a positive number, which will instruct
io_iopoll_check() to break the loop and end the syscall, even if there
is not enough CQEs or none at all.

Don't return positive numbers, so io_iopoll_check() exits only when it
gets an actual error, need reschedule or got enough CQEs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/641a88f751623b6758303b3171f0a4141f06726e.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:59 -06:00
Pavel Begunkov
864ea921b0 io_uring: refactor io_alloc_req
Replace the main if of io_flush_cached_reqs() with inverted condition +
goto, so all the cases are handled in the same way. And also extract
io_preinit_req() to make it cleaner and easier to refer to.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1abcba1f7b55dc53bf1dbe95036e345ffb1d5b01.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:59 -06:00
Pavel Begunkov
8724dd8c83 io-wq: improve wq_list_add_tail()
Prepare nodes that we're going to add before actually linking them, it's
always safer and costs us nothing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f7e53f0c84c02ed6748c488ed0789b98f8cc6185.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:56 -06:00
Pavel Begunkov
2215bed924 io_uring: remove unnecessary PF_EXITING check
We prefer nornal task_works even if it would fail requests inside. Kill
a PF_EXITING check in io_req_task_work_add(), task_work_add() handles
well dying tasks, i.e. return error when can't enqueue due to late
stages of do_exit().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fc14297e8441cd8f5d1743a2488cf0df09bf48ac.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:56 -06:00
Pavel Begunkov
ebc11b6c6b io_uring: clean io-wq callbacks
Move io-wq callbacks closer to each other, so it's easier to work with
them, and rename io_free_work() into io_wq_free_work() for consistency.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/851bbc7f0f86f206d8c1333efee8bcb9c26e419f.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:56 -06:00
Pavel Begunkov
c97d8a0f68 io_uring: avoid touching inode in rw prep
If we use fixed files, we can be sure (almost) that REQ_F_ISREG is set.
However, for non-reg files io_prep_rw() still will look into inode to
double check, and that's expensive and can be avoided.

The only caveat is that it only currently works with 64+ bit
architectures, see FFS_ISREG, so we should consider that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0a62780c491ca2522cd52db4ae3f16e03aafed0f.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:56 -06:00
Pavel Begunkov
b191e2dfe5 io_uring: rename io_file_supports_async()
io_file_supports_async() checks whether a file supports nowait
operations, so "async" in the name is misleading. Rename it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/33d55b5ce43aa1884c637c1957f1e30d30dc3bec.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:56 -06:00
Pavel Begunkov
ac177053bb io_uring: inline fixed part of io_file_get()
Optimise io_file_get() with registered files, which is in a hot path,
by inlining parts of the function. Saves a function call, and
inefficiencies of passing arguments, e.g. evaluating
(sqe_flags & IOSQE_FIXED_FILE).

It couldn't have been done before as compilers were refusing to inline
it because of the function size.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/52115cd6ce28f33bd0923149c0e6cb611084a0b1.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:56 -06:00
Pavel Begunkov
042b0d85ea io_uring: use kvmalloc for fixed files
Instead of hand-coded two-level tables for registered files, allocate
them with kvmalloc(). In many cases small enough tables are enough, and
so can be kmalloc()'ed removing an extra memory load and a bunch of bit
logic instructions from the hot path. If the table is larger, we trade
off all the pros with a TLB-assisted memory lookup.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/280421d3b48775dabab773006bb5588c7b2dabc0.1628471125.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:56 -06:00
Jens Axboe
5fd4617840 io_uring: be smarter about waking multiple CQ ring waiters
Currently we only wake the first waiter, even if we have enough entries
posted to satisfy multiple waiters. Improve that situation so that
every waiter knows how much the CQ tail has to advance before they can
be safely woken up.

With this change, if we have N waiters each asking for 1 event and we get
4 completions, then we wake up 4 waiters. If we have N waiters asking
for 2 completions and we get 4 completions, then we wake up the first
two. Previously, only the first waiter would've been woken up.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:56 -06:00
Jens Axboe
d3e9f732c4 io-wq: remove GFP_ATOMIC allocation off schedule out path
Daniel reports that the v5.14-rc4-rt4 kernel throws a BUG when running
stress-ng:

| [   90.202543] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:35
| [   90.202549] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2047, name: iou-wrk-2041
| [   90.202555] CPU: 5 PID: 2047 Comm: iou-wrk-2041 Tainted: G        W         5.14.0-rc4-rt4+ #89
| [   90.202559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
| [   90.202561] Call Trace:
| [   90.202577]  dump_stack_lvl+0x34/0x44
| [   90.202584]  ___might_sleep.cold+0x87/0x94
| [   90.202588]  rt_spin_lock+0x19/0x70
| [   90.202593]  ___slab_alloc+0xcb/0x7d0
| [   90.202598]  ? newidle_balance.constprop.0+0xf5/0x3b0
| [   90.202603]  ? dequeue_entity+0xc3/0x290
| [   90.202605]  ? io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202610]  ? pick_next_task_fair+0xb9/0x330
| [   90.202612]  ? __schedule+0x670/0x1410
| [   90.202615]  ? io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202618]  kmem_cache_alloc_trace+0x79/0x1f0
| [   90.202621]  io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202625]  io_wq_worker_sleeping+0x37/0x50
| [   90.202628]  schedule+0x30/0xd0
| [   90.202630]  schedule_timeout+0x8f/0x1a0
| [   90.202634]  ? __bpf_trace_tick_stop+0x10/0x10
| [   90.202637]  io_wqe_worker+0xfd/0x320
| [   90.202641]  ? finish_task_switch.isra.0+0xd3/0x290
| [   90.202644]  ? io_worker_handle_work+0x670/0x670
| [   90.202646]  ? io_worker_handle_work+0x670/0x670
| [   90.202649]  ret_from_fork+0x22/0x30

which is due to the RT kernel not liking a GFP_ATOMIC allocation inside
a raw spinlock. Besides that not working on RT, doing any kind of
allocation from inside schedule() is kind of nasty and should be avoided
if at all possible.

This particular path happens when an io-wq worker goes to sleep, and we
need a new worker to handle pending work. We currently allocate a small
data item to hold the information we need to create a new worker, but we
can instead include this data in the io_worker struct itself and just
protect it with a single bit lock. We only really need one per worker
anyway, as we will have run pending work between to sleep cycles.

https://lore.kernel.org/lkml/20210804082418.fbibprcwtzyt5qax@beryllium.lan/
Reported-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:07:56 -06:00
Linus Torvalds
15517c724c File locking change for v5.14
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmEg5JwTHGpsYXl0b25A
 a2VybmVsLm9yZwAKCRAADmhBGVaCFfX3D/94AsPqwbnIyLgP2KjoltTKtnSsOZba
 tzTKwrA7vPB9b0aFg0i42omwRtGQk5jPaB8J+tBx+oc/JjO+Dpj5FOXOI9PjFr//
 2zXBuMdvYWaPlsl9nw5dFzhybdPnv+HgNXA8PRqy/DifU7/oLWxHzQ4sGvr1XXNc
 vow5KzLe1K4R+hQBwqKLh/9VVE4+foVTAwXWcFBi/RrDf5X97BF7s98JYsNxfpbn
 EYSyLr818TCUlivAUzx+y0JD+qUknZLuNYWZ2HkYvI29y5h1gUCVmn1orlCDdDq/
 G3SCF8M66l4xGW+bpzrEabqIDl+0IWhX+grU+UVdiMH4YuXmY4HPdNgKYIpJolyy
 zdg8cCO3OAHTJiwkj5jSfxFHQnzPr58LSQGDfrIrNjcDKlQbx3c+8R5yMy7Ar7jZ
 XAM6h88PAuBknTDWBQogZtuSqKbV8D2LsAUVgRKA7iKSwYXXUmZdW+UDiOavsHmg
 n2fbi1sPP7woQKyXzFddwxG3+2Nzby8BE7xuyHTXdOrQJNE1PvICH+WVo2m8+FCe
 uLKLXNf4UECO2MaSyd6k90v3AVty4u1EDTbitgcERztWGzazZdTVlmRdpwy35V+B
 wskKALzjGgqFoSwkEAnXQdk4+Gk9EtnrXD+HYfmxGYKk87fOBCgJtqDWYw3RHK7n
 giBfvji9HK3R8A==
 =xZLO
 -----END PGP SIGNATURE-----

Merge tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull mandatory file locking deprecation warning from Jeff Layton:
 "As discussed on the list, this patch just adds a new warning for folks
  who still have mandatory locking enabled and actually mount with '-o
  mand'. I'd like to get this in for v5.14 so we can push this out into
  stable kernels and hopefully reach folks who have mounts with -o mand.

  For now, I'm operating under the assumption that we'll fully remove
  this support in v5.15, but we can move that out if any legitimate
  users of this facility speak up between now and then"

* tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fs: warn about impending deprecation of mandatory locks
2021-08-21 10:50:22 -07:00
Linus Torvalds
1e6907d58c io_uring-5.14-2021-08-20
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEgaZ8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmFCEACzSqGGsFj7kxxxX45n8r9+3cnfSN2HRnBe
 rcKEEypK7ZyohfIfCQ+Qxhi6NCkYYtV3Enr7ErKUAbe8XpO0Eza84h0bjEBozSUa
 ZxyDVOVPHF4r0P6pGmYKNNijlPImx7ORL8kAfGTDLi0f3Qc1Chmkej1K9g5SBjIz
 qxzvv+I6L2OIokZwjmw/zCOD9GMvFtfPbL4xG1MEdXxH+aXtitkcnGlh7ZWxfFnz
 Mw4/QiyBOsya2clhvqT0FckED63wy5h0es/R0WMleIYQbSeORF2YZGejd7LoaeQj
 5PMvTeqv1f85lBo9BuUxp23/KDdwy0MOaKktOMFGWOkAvRH0ezjebNtUr6yBtXnt
 UyD/Wpdhpo0x6ZcpqB7UnwADKgd41GjPfQKqH/zNZta1vhfldg0mzuN6wfQpumg6
 j+fp/7uY/lwo5fpOF4CLwVrQEENi4Yir0aA4M6s53UknmrO26H0qHRM/uiHhmRVr
 gG5s5ZwJgabepb9VP7fCQ7NLMpfiijTPFEUahBgB5QTtVomUhsWY/XyODoGkF0PS
 frKe6+0IakmkX+fzNV3vTzchew8ySesQacU6y1ugl0cPqL9si1uqzYdv6+bOgry4
 NKlsRAqeYQAivQSdxdAE96+zIWnwJqQ6GKyvb/9XgPACGSWkzm7s5yXdXWKLwoBH
 FSPqTXs6Tw==
 =/PEi
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.14-2021-08-20' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "A few small fixes that should go into this release:

   - Fix never re-assigning an initial error value for io_uring_enter()
     for SQPOLL, if asked to do nothing

   - Fix xa_alloc_cycle() return value checking, for cases where we have
     wrapped around

   - Fix for a ctx pin issue introduced in this cycle (Pavel)"

* tag 'io_uring-5.14-2021-08-20' of git://git.kernel.dk/linux-block:
  io_uring: fix xa_alloc_cycle() error return value check
  io_uring: pin ctx on fallback execution
  io_uring: only assign io_uring_enter() SQPOLL error in actual error case
2021-08-21 08:06:26 -07:00
Jeff Layton
fdd92b64d1 fs: warn about impending deprecation of mandatory locks
We've had CONFIG_MANDATORY_FILE_LOCKING since 2015 and a lot of distros
have disabled it. Warn the stragglers that still use "-o mand" that
we'll be dropping support for that mount option.

Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-08-21 07:32:45 -04:00
Jens Axboe
a30f895ad3 io_uring: fix xa_alloc_cycle() error return value check
We currently check for ret != 0 to indicate error, but '1' is a valid
return and just indicates that the allocation succeeded with a wrap.
Correct the check to be for < 0, like it was before the xarray
conversion.

Cc: stable@vger.kernel.org
Fixes: 61cf93700f ("io_uring: Convert personality_idr to XArray")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-20 14:59:58 -06:00
Linus Torvalds
d6d09a6942 for-5.14-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEdNDEACgkQxWXV+ddt
 WDvfDw//cDnR8HZEtrHwHX9qHitcYs6pubdwwGAsFlSZ/wh0iX05TxUjho4gGYMZ
 Kp9PXipMOEdxNLJ8oaPkI+i8vIXxTWWqAm5ZePkV0cjg+vTgqqKf9NLcMtS34kP4
 /GeQJgul9oreTMbXCx219J0B6lKpl6Iv0sCaSFyN09GIPNI8F6nyDbJTA3JKTRWb
 ElP8mGvdUFFcOKsG6Wh6BU/WVU/My7d+HumApsRXB2lDwmMambAkX0iGpRElGrbD
 ub+5ya0WeO8DB6KsVa4W8cMO5sWV9L9FcXMtGlwLbIkOxFdHvP7CT1pvH3TZe9Wy
 mr8oAL01IktuNjZgQ5sUn+yISf+LuHnWjhpu+QBRuylZiwfpMwSPb0geLcrXcYGj
 i8ERlmJvwbm6dAQlQDbA3yZKH6+FzePyTR99std2LK9JtbqBaFeSS6WM05SpRUDJ
 FNHCLOzsswzBUE54nkqsb+A8tBXpcxnvQkrU+nJeDNUYM9w6S5mbCGeZjxe/n+ov
 TGprz1ar2Ppm9YMH0zj6wOM690nJZYNrAvtUmeCl1xlLERYIoV2jXS1SzkaMdcQu
 u3UVVsOCghPN5krEac4jgiGBdVHvjVnJb/qGBNTpj3aDX29PADHUJr7TmzGJBa8F
 ePWqDngDYi+cVrm9JMls9UJhaCVmhzLAXtN5X3+fKfe5bNQE4gU=
 =aR16
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "One more fix for cross-rename, adding a missing check for directory
  and subvolume, this could lead to a crash"

* tag 'for-5.14-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: prevent rename2 from exchanging a subvol with a directory from different parents
2021-08-18 12:06:42 -07:00
Linus Torvalds
3b844826b6 pipe: avoid unnecessary EPOLLET wakeups under normal loads
I had forgotten just how sensitive hackbench is to extra pipe wakeups,
and commit 3a34b13a88 ("pipe: make pipe writes always wake up
readers") ended up causing a quite noticeable regression on larger
machines.

Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
not clear that this matters in real life all that much, but as Mel
points out, it's used often enough when comparing kernels and so the
performance regression shows up like a sore thumb.

It's easy enough to fix at least for the common cases where pipes are
used purely for data transfer, and you never have any exciting poll
usage at all.  So set a special 'poll_usage' flag when there is polling
activity, and make the ugly "EPOLLET has crazy legacy expectations"
semantics explicit to only that case.

I would love to limit it to just the broken EPOLLET case, but the pipe
code can't see the difference between epoll and regular select/poll, so
any non-read/write waiting will trigger the extra wakeup behavior.  That
is sufficient for at least the hackbench case.

Apart from making the odd extra wakeup cases more explicitly about
EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
write, not at the first write chunk.  That is actually much saner
semantics (as much as you can call any of the legacy edge-triggered
expectations for EPOLLET "sane") since it means that you know the wakeup
will happen once the write is done, rather than possibly in the middle
of one.

[ For stable people: I'm putting a "Fixes" tag on this, but I leave it
  up to you to decide whether you actually want to backport it or not.
  It likely has no impact outside of synthetic benchmarks  - Linus ]

Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
Fixes: 3a34b13a88 ("pipe: make pipe writes always wake up readers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Sandeep Patil <sspatil@android.com>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-18 11:39:46 -07:00
Pavel Begunkov
9cb0073b30 io_uring: pin ctx on fallback execution
Pin ring in io_fallback_req_func() by briefly elevating ctx->refs in
case any task_work handler touches ctx after releasing a request.

Fixes: 9011bf9a13 ("io_uring: fix stuck fallback reqs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/833a494713d235ec144284a9bbfe418df4f6b61c.1629235576.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-17 16:06:14 -06:00
NeilBrown
3f79f6f624 btrfs: prevent rename2 from exchanging a subvol with a directory from different parents
Cross-rename lacks a check when that would prevent exchanging a
directory and subvolume from different parent subvolume. This causes
data inconsistencies and is caught before commit by tree-checker,
turning the filesystem to read-only.

Calling the renameat2 with RENAME_EXCHANGE flags like

  renameat2(AT_FDCWD, namesrc, AT_FDCWD, namedest, (1 << 1))

on two paths:

  namesrc = dir1/subvol1/dir2
 namedest = subvol2/subvol3

will cause key order problem with following write time tree-checker
report:

  [1194842.307890] BTRFS critical (device loop1): corrupt leaf: root=5 block=27574272 slot=10 ino=258, invalid previous key objectid, have 257 expect 258
  [1194842.322221] BTRFS info (device loop1): leaf 27574272 gen 8 total ptrs 11 free space 15444 owner 5
  [1194842.331562] BTRFS info (device loop1): refs 2 lock_owner 0 current 26561
  [1194842.338772]        item 0 key (256 1 0) itemoff 16123 itemsize 160
  [1194842.338793]                inode generation 3 size 16 mode 40755
  [1194842.338801]        item 1 key (256 12 256) itemoff 16111 itemsize 12
  [1194842.338809]        item 2 key (256 84 2248503653) itemoff 16077 itemsize 34
  [1194842.338817]                dir oid 258 type 2
  [1194842.338823]        item 3 key (256 84 2363071922) itemoff 16043 itemsize 34
  [1194842.338830]                dir oid 257 type 2
  [1194842.338836]        item 4 key (256 96 2) itemoff 16009 itemsize 34
  [1194842.338843]        item 5 key (256 96 3) itemoff 15975 itemsize 34
  [1194842.338852]        item 6 key (257 1 0) itemoff 15815 itemsize 160
  [1194842.338863]                inode generation 6 size 8 mode 40755
  [1194842.338869]        item 7 key (257 12 256) itemoff 15801 itemsize 14
  [1194842.338876]        item 8 key (257 84 2505409169) itemoff 15767 itemsize 34
  [1194842.338883]                dir oid 256 type 2
  [1194842.338888]        item 9 key (257 96 2) itemoff 15733 itemsize 34
  [1194842.338895]        item 10 key (258 12 256) itemoff 15719 itemsize 14
  [1194842.339163] BTRFS error (device loop1): block=27574272 write time tree block corruption detected
  [1194842.339245] ------------[ cut here ]------------
  [1194842.443422] WARNING: CPU: 6 PID: 26561 at fs/btrfs/disk-io.c:449 csum_one_extent_buffer+0xed/0x100 [btrfs]
  [1194842.511863] CPU: 6 PID: 26561 Comm: kworker/u17:2 Not tainted 5.14.0-rc3-git+ #793
  [1194842.511870] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
  [1194842.511876] Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
  [1194842.511976] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
  [1194842.512068] RSP: 0018:ffffa2c284d77da0 EFLAGS: 00010282
  [1194842.512074] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff928867bd9978
  [1194842.512078] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff928867bd9970
  [1194842.512081] RBP: ffff92876b958000 R08: 0000000000000001 R09: 00000000000c0003
  [1194842.512085] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
  [1194842.512088] R13: ffff92875f989f98 R14: 0000000000000000 R15: 0000000000000000
  [1194842.512092] FS:  0000000000000000(0000) GS:ffff928867a00000(0000) knlGS:0000000000000000
  [1194842.512095] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [1194842.512099] CR2: 000055f5384da1f0 CR3: 0000000102fe4000 CR4: 00000000000006e0
  [1194842.512103] Call Trace:
  [1194842.512128]  ? run_one_async_free+0x10/0x10 [btrfs]
  [1194842.631729]  btree_csum_one_bio+0x1ac/0x1d0 [btrfs]
  [1194842.631837]  run_one_async_start+0x18/0x30 [btrfs]
  [1194842.631938]  btrfs_work_helper+0xd5/0x1d0 [btrfs]
  [1194842.647482]  process_one_work+0x262/0x5e0
  [1194842.647520]  worker_thread+0x4c/0x320
  [1194842.655935]  ? process_one_work+0x5e0/0x5e0
  [1194842.655946]  kthread+0x135/0x160
  [1194842.655953]  ? set_kthread_struct+0x40/0x40
  [1194842.655965]  ret_from_fork+0x1f/0x30
  [1194842.672465] irq event stamp: 1729
  [1194842.672469] hardirqs last  enabled at (1735): [<ffffffffbd1104f5>] console_trylock_spinning+0x185/0x1a0
  [1194842.672477] hardirqs last disabled at (1740): [<ffffffffbd1104cc>] console_trylock_spinning+0x15c/0x1a0
  [1194842.672482] softirqs last  enabled at (1666): [<ffffffffbdc002e1>] __do_softirq+0x2e1/0x50a
  [1194842.672491] softirqs last disabled at (1651): [<ffffffffbd08aab7>] __irq_exit_rcu+0xa7/0xd0

The corrupted data will not be written, and filesystem can be unmounted
and mounted again (all changes since the last commit will be lost).

Add the missing check for new_ino so that all non-subvolumes must reside
under the same parent subvolume. There's an exception allowing to
exchange two subvolumes from any parents as the directory representing a
subvolume is only a logical link and does not have any other structures
related to the parent subvolume, unlike files, directories etc, that
are always in the inode namespace of the parent subvolume.

Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.7+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-16 13:33:23 +02:00
Linus Torvalds
7ba34c0cba libnvdimm fixes for v5.14-rc6
- Fix support for NFIT "virtual" ranges (BIOS-defined memory disks)
 
 - Fix recovery from failed label storage areas on NVDIMM devices
 
 - Miscellaneous cleanups from Ira's investigation of dax_direct_access
   paths preparing for stray-write protection.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYRhC0wAKCRDfioYZHlFs
 Z6InAQD+duS9GS5DnnFInmRDj/rMRQFVB4X25mmSlViYOR0gNwEAtJQP03CGAp+G
 +DP7/nu2HrIhx8Ng8vTsu8ZnO8ge7Qw=
 =zmii
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-fixes-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm fixes from Dan Williams:
 "A couple of fixes for long standing bugs, a warning fixup, and some
  miscellaneous dax cleanups.

  The bugs were recently found due to new platforms looking to use the
  ACPI NFIT "virtual" device definition, and new error injection
  capabilities to trigger error responses to label area requests. Ira's
  cleanups have been long pending, I neglected to send them earlier, and
  see no harm in including them now. This has all appeared in -next with
  no reported issues.

  Summary:

   - Fix support for NFIT "virtual" ranges (BIOS-defined memory disks)

   - Fix recovery from failed label storage areas on NVDIMM devices

   - Miscellaneous cleanups from Ira's investigation of
     dax_direct_access paths preparing for stray-write protection"

* tag 'libnvdimm-fixes-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  tools/testing/nvdimm: Fix missing 'fallthrough' warning
  libnvdimm/region: Fix label activation vs errors
  ACPI: NFIT: Fix support for virtual SPA ranges
  dax: Ensure errno is returned from dax_direct_access
  fs/dax: Clarify nr_pages to dax_direct_access()
  fs/fuse: Remove unneeded kaddr parameter
2021-08-14 19:46:39 -10:00
Jens Axboe
21f965221e io_uring: only assign io_uring_enter() SQPOLL error in actual error case
If an SQPOLL based ring is newly created and an application issues an
io_uring_enter(2) system call on it, then we can return a spurious
-EOWNERDEAD error. This happens because there's nothing to submit, and
if the caller doesn't specify any other action, the initial error
assignment of -EOWNERDEAD never gets overwritten. This causes us to
return it directly, even if it isn't valid.

Move the error assignment into the actual failure case instead.

Cc: stable@vger.kernel.org
Fixes: d9d05217cb ("io_uring: stop SQPOLL submit on creator's death")
Reported-by: Sherlock Holo sherlockya@gmail.com
Link: https://github.com/axboe/liburing/issues/413
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-14 12:38:21 -06:00
Linus Torvalds
118516e212 configfs fix for Linux 5.14
- fix to revert to the historic write behavior (Bart Van Assche)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmEXaXgLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYNQLg/+Lagdz93ESb7VZHGKMQQkyMM4Zx8DBv3eMRaIAw19
 jK87v15tGrrcse/JLmBWo1s3d5HDZGKOYhsUsv2dAqsa3P7S5p7Hihz4WSGlEQAS
 UnqqHUafVTPwBqHgt1StF9BpE6QH2zovlJeHnSok6fPvJcUvC5h9Z83mgNW2SUf/
 zut1GnqVp82jaDDfJymLIFpT4hRjfj2CpsMa38YU/M0Bunhn87tUFKHVzpdnTG9G
 v0iLXuGfax1KWJCX3Sf4Pw9vCCTzIUHmWrbH/8X/AywYe5enhuHfTFQAxn623jAg
 TzFoU/ByR3Je4zhDmci20Kdgay3LREgjGO3iloZG2KcnRJZOSzYU+SX5IWQZvLon
 JWDqDzr8iR7DIdrfNjIbehYj9DRdlxn1iUr8mvCVK6uxN2deyiLHamD2kqv9fklW
 D6TOHHkwrCF8k+jQfAc9l5+vk98UsJwFyT9BYatA6U/jtffxlsf7OuN0LHRtzu7a
 4zdy5U/7tqT7W4PHy4/ICZN2ka2mm1c5I7JyjEgdj0Qongml4m7g/3vxSEKPJCeB
 Rj2SCA8163RqYTywEUO5lcjpTbwZBG4pPx6PMGIrhCGGnqdl+RcNVy3Kt2LEdbiq
 WXq7hQGoOsZLRkloej1B2D9x9mqyYPLzT+w/xzd5iJKVrLv06LHyi/d0GCKTHUNp
 XN8=
 =dsC9
 -----END PGP SIGNATURE-----

Merge tag 'configfs-5.14' of git://git.infradead.org/users/hch/configfs

Pull configfs fix from Christoph Hellwig:

 - fix to revert to the historic write behavior (Bart Van Assche)

* tag 'configfs-5.14' of git://git.infradead.org/users/hch/configfs:
  configfs: restore the kernel v5.13 text attribute write behavior
2021-08-14 06:22:42 -10:00
Linus Torvalds
27b2eaa118 4 CIFS/SMB3 Fixes, all for stable, 2 relating to deferred close, 1 for modefromsid mount option
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmEW5MQACgkQiiy9cAdy
 T1H6zQwAwOyQrMGA3VtsatmhhQo89oHbkUB4q4+tY7xabPNgTRAPT9BpRSFsrW4y
 njeiaP1T7nmiU1yJsYQD/JwRv005pBO/xa7sMsLb0RSca//kzin4WTTzxom3RUBW
 hU29PvAxl+AjaRvbo6VpSn6xuHH1BDcZU+YfWtX3c6tE30sdzwjyMu7rEhivNGCf
 0ukIZuIaEJrZmCwMZHT8+qE1dBOKEad8I39POG1v+mybQCWJvo4MAUDyYHZYKTJz
 6e3JDARI19G8hfq1oVAM/g5gIBRBEISug3jenOMVG/QnBnRsBvyuTIunoq4ba+S6
 qp3jv3p24DaUe9FPp5sYyAhuJHo0rzSwGBv/SdikJA+3xb8k2E3rECAUbf4j/NmV
 /sj0tN/6/Z05/6L4ZgqUjdS2KfLztDvgzTGnv/LsB097Nhb8hVIhsKoqzypmyAcH
 5AsjXFETMCclWE7oE8DkzaAOJqKD7MGJ6KXCjbt8JcFb/b6L/QQbRyRB5ggupgVn
 Ic7gn/iM
 =RlpY
 -----END PGP SIGNATURE-----

Merge tag '5.14-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Four CIFS/SMB3 Fixes, all for stable, two relating to deferred close,
  and one for the 'modefromsid' mount option (when 'idsfromsid' not
  specified)"

* tag '5.14-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Call close synchronously during unlink/rename/lease break.
  cifs: Handle race conditions during rename
  cifs: use the correct max-length for dentry_path_raw()
  cifs: create sd context must be a multiple of 8
2021-08-13 14:44:32 -10:00
Linus Torvalds
42995cee61 io_uring-5.14-2021-08-13
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEWwP4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgqrEAC93z9w25k1mqVZLjMJgp1IKT3KsB1xfn+i
 QwLz1OVpveyf5fKzZEK6xPn6DhX22FmGca0FhPNYANlNJNgp8X1cT0REf+hdDbYJ
 IqF14Ue9DO5mbBiutb9aEYoTSZjg1HaZFPRfu71QkFZbTuVx8CoMwDsl1OXG7bcO
 5G67khXZFkChmaJVgaANfR4EKKqHXrVt+EOoOyBYdSZLWloWu2KuODXt/FtmllHc
 kf7bPTw0Za5f6oSJXX52B+RlMyK3HrA9FZZor8alyUd0GVsgu5vwY2vFNSyft8t2
 RazK/oqU+hBmV4wNVrAYAByYiq+j3lVLvE3AXPI81cXLyPyCVG0GqqRcycuUaMjV
 WikJyo6dGQh70yOXw3BhQojPsEL1OPcyMxIi+3g/TKL9kIYOBaO7Es+KwoEKD4fX
 fAkucTsb/Tc3WsYI3ELOedy9WuIMzAgysZxcTUxphKmRtwFNXLOYmQJEW5dK6MxZ
 sZfve0JCQsFdebHOjIIanPRJSwksTt0Xfdm1g+R8sDBfnLV81PbGY+8lrRgUiBJ9
 DYDd0hRPFnBuWT8h1qToUqKcUHL73IefXstMnstdtWwrL1FZs4JsPGYdZMxqfXSk
 Z/50aHSmB9KaouFaIrxM4rndrAgP2BJwXOu2f1YbiPTcKNpmdYTX5jT7mfaLSWX5
 ZpF6fkDiTA==
 =lx8b
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.14-2021-08-13' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "A bit bigger than the previous weeks, but mostly just a few stable
  bound fixes. In detail:

   - Followup fixes to patches from last week for io-wq, turns out they
     weren't complete (Hao)

   - Two lockdep reported fixes out of the RT camp (me)

   - Sync the io_uring-cp example with liburing, as a few bug fixes
     never made it to the kernel carried version (me)

   - SQPOLL related TIF_NOTIFY_SIGNAL fix (Nadav)

   - Use WRITE_ONCE() when writing sq flags (Nadav)

   - io_rsrc_put_work() deadlock fix (Pavel)"

* tag 'io_uring-5.14-2021-08-13' of git://git.kernel.dk/linux-block:
  tools/io_uring/io_uring-cp: sync with liburing example
  io_uring: fix ctx-exit io_rsrc_put_work() deadlock
  io_uring: drop ctx->uring_lock before flushing work item
  io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker()
  io-wq: fix bug of creating io-wokers unconditionally
  io_uring: rsrc ref lock needs to be IRQ safe
  io_uring: Use WRITE_ONCE() when writing to sq_flags
  io_uring: clear TIF_NOTIFY_SIGNAL when running task work
2021-08-13 13:25:08 -10:00
Linus Torvalds
3a03c67de2 A patch to avoid a soft lockup in ceph_check_delayed_caps() from Luis
and a reference handling fix from Jeff that should address some memory
 corruption reports in the snaprealm area.  Both marked for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmEVaqsTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi/DBCACd7+mnAXIwajwoDdXFIJT7/tfimdvU
 cMrh6ciZNtEKxm23flQ1AFJXlXR/nlZRspfOmlmsl9bB4TAlXnhJ/s4JaiuOMMTh
 OQ4oz0vAbGELkPsXB/FXGSSk1wTFEjCocFsJwoYiUkYjD7Qt12BZKNkFYgj/MVc2
 wyJ5K1buqBLVFDU+CymqDzc07YpG1zn888o7UGWFTyevldRAHl2euxqbnr0S4qb9
 OS5UKO3aFCEt5PT9RKRHygCGjuHym/fgXgPm9aNY4rYBE9qOXloVUOD5bhMHBJ2E
 g506xhOurqbGv4O9oj+gvBwtQwY/TF8BvCA79koQSHNIYQsC/bcXenST
 =m8x8
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.14-rc6' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A patch to avoid a soft lockup in ceph_check_delayed_caps() from Luis
  and a reference handling fix from Jeff that should address some memory
  corruption reports in the snaprealm area.

  Both marked for stable"

* tag 'ceph-for-5.14-rc6' of git://github.com/ceph/ceph-client:
  ceph: take snap_empty_lock atomically with snaprealm refcount change
  ceph: reduce contention in ceph_check_delayed_caps()
2021-08-12 16:16:01 -10:00
Linus Torvalds
f8fbb47c6e Merge branch 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ucounts fix from Eric Biederman:
 "This fixes the ucount sysctls on big endian architectures.

  The counts were expanded to be longs instead of ints, and the sysctl
  code was overlooked, so only the low 32bit were being processed. On
  litte endian just processing the low 32bits is fine, but on 64bit big
  endian processing just the low 32bits results in the high order bits
  instead of the low order bits being processed and nothing works
  proper.

  This change took a little bit to mature as we have the SYSCTL_ZERO,
  and SYSCTL_INT_MAX macros that are only usable for sysctls operating
  on ints, but unfortunately are not obviously broken. Which resulted in
  the versions of this change working on big endian and not on little
  endian, because the int SYSCTL_ZERO when extended 64bit wound up being
  0x100000000. So we only allowed values greater than 0x100000000 and
  less than 0faff. Which unfortunately broken everything that tried to
  set the sysctls. (First reported with the windows subsystem for
  linux).

  I have tested this on x86_64 64bit after first reproducing the
  problems with the earlier version of this change, and then verifying
  the problems do not exist when we use appropriate long min and max
  values for extra1 and extra2"

* 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  ucounts: add missing data type changes
2021-08-12 07:20:16 -10:00
Rohith Surabattula
9e992755be cifs: Call close synchronously during unlink/rename/lease break.
During unlink/rename/lease break, deferred work for close is
scheduled immediately but in an asynchronous manner which might
lead to race with actual(unlink/rename) commands.

This change will schedule close synchronously which will avoid
the race conditions with other commands.

Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org # 5.13
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-12 11:29:58 -05:00
Rohith Surabattula
41535701da cifs: Handle race conditions during rename
When rename is executed on directory which has files for which
close is deferred, then rename will fail with EACCES.

This patch will try to close all deferred files when EACCES is received
and retry rename on a directory.

Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Cc: stable@vger.kernel.org # 5.13
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-12 11:29:54 -05:00
Dan Williams
96dcb97d0a Merge branch 'for-5.14/dax' into libnvdimm-fixes
Pick up some small dax cleanups that make some of Ira's follow on work
easier.
2021-08-11 12:04:43 -07:00
Linus Torvalds
b3f0ccc59c overlayfs fixes for 5.14-rc6
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYRKSDgAKCRDh3BK/laaZ
 PKSNAQCd1yGLShL44sI5lCFnGjwHGCXdfU5b8sIxNBy5DOWvTwD/edF4eUJzyME+
 mZ4AwnX70N2eHJCFH/uodL0Y9Sf3egM=
 =zUIV
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-5.14-rc6-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs fixes from Miklos Szeredi:
 "Fix several bugs in overlayfs"

* tag 'ovl-fixes-5.14-rc6-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: prevent private clone if bind mount is not allowed
  ovl: fix uninitialized pointer read in ovl_lookup_real_one()
  ovl: fix deadlock in splice write
  ovl: skip stale entries in merge dir cache iteration
2021-08-10 09:40:09 -07:00
Ronnie Sahlberg
981567bd96 cifs: use the correct max-length for dentry_path_raw()
RHBZ: 1972502

PATH_MAX is 4096 but PAGE_SIZE can be >4096 on some architectures
such as ppc and would thus write beyond the end of the actual object.

Cc: <stable@vger.kernel.org>
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Suggested-by: Brian foster <bfoster@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-10 10:45:50 -05:00
Miklos Szeredi
427215d85e ovl: prevent private clone if bind mount is not allowed
Add the following checks from __do_loopback() to clone_private_mount() as
well:

 - verify that the mount is in the current namespace

 - verify that there are no locked children

Reported-by: Alois Wohlschlager <alois1@gmx-topmail.de>
Fixes: c771d683a6 ("vfs: introduce clone_private_mount()")
Cc: <stable@vger.kernel.org> # v3.18
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-10 10:21:31 +02:00
Miklos Szeredi
580c610429 ovl: fix uninitialized pointer read in ovl_lookup_real_one()
One error path can result in release_dentry_name_snapshot() being called
before "name" was initialized by take_dentry_name_snapshot().

Fix by moving the release_dentry_name_snapshot() to immediately after the
only use.

Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-10 10:21:30 +02:00
Miklos Szeredi
9b91b6b019 ovl: fix deadlock in splice write
There's possibility of an ABBA deadlock in case of a splice write to an
overlayfs file and a concurrent splice write to a corresponding real file.

The call chain for splice to an overlay file:

 -> do_splice                     [takes sb_writers on overlay file]
   -> do_splice_from
     -> iter_file_splice_write    [takes pipe->mutex]
       -> vfs_iter_write
         ...
         -> ovl_write_iter        [takes sb_writers on real file]

And the call chain for splice to a real file:

 -> do_splice                     [takes sb_writers on real file]
   -> do_splice_from
     -> iter_file_splice_write    [takes pipe->mutex]

Syzbot successfully bisected this to commit 82a763e61e ("ovl: simplify
file splice").

Fix by reverting the write part of the above commit and by adding missing
bits from ovl_write_iter() into ovl_splice_write().

Fixes: 82a763e61e ("ovl: simplify file splice")
Reported-and-tested-by: syzbot+579885d1a9a833336209@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-10 10:21:30 +02:00
Amir Goldstein
9011c2791e ovl: skip stale entries in merge dir cache iteration
On the first getdents call, ovl_iterate() populates the readdir cache
with a list of entries, but for upper entries with origin lower inode,
p->ino remains zero.

Following getdents calls traverse the readdir cache list and call
ovl_cache_update_ino() for entries with zero p->ino to lookup the entry
in the overlay and return d_ino that is consistent with st_ino.

If the upper file was unlinked between the first getdents call and the
getdents call that lists the file entry, ovl_cache_update_ino() will not
find the entry and fall back to setting d_ino to the upper real st_ino,
which is inconsistent with how this object was presented to users.

Instead of listing a stale entry with inconsistent d_ino, simply skip
the stale entry, which is better for users.

xfstest overlay/077 is failing without this patch.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/fstests/CAOQ4uxgR_cLnC_vdU5=seP3fwqVkuZM_-WfD6maFTMbMYq=a9w@mail.gmail.com/
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-10 10:21:30 +02:00
Pavel Begunkov
43597aac1f io_uring: fix ctx-exit io_rsrc_put_work() deadlock
__io_rsrc_put_work() might need ->uring_lock, so nobody should wait for
rsrc nodes holding the mutex. However, that's exactly what
io_ring_ctx_free() does with io_wait_rsrc_data().

Split it into rsrc wait + dealloc, and move the first one out of the
lock.

Cc: stable@vger.kernel.org
Fixes: b60c8dce33 ("io_uring: preparation for rsrc tagging")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0130c5c2693468173ec1afab714e0885d2c9c363.1628559783.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09 19:59:28 -06:00
Jens Axboe
c018db4a57 io_uring: drop ctx->uring_lock before flushing work item
Ammar reports that he's seeing a lockdep splat on running test/rsrc_tags
from the regression suite:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Tainted: G           OE
------------------------------------------------------
kworker/2:4/2684 is trying to acquire lock:
ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0

but task is already holding lock:
ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}:
       __flush_work+0x31b/0x490
       io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0
       __do_sys_io_uring_register+0x45b/0x1060
       do_syscall_64+0x35/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #0 (&ctx->uring_lock){+.+.}-{3:3}:
       __lock_acquire+0x119a/0x1e10
       lock_acquire+0xc8/0x2f0
       __mutex_lock+0x86/0x740
       io_rsrc_put_work+0x13d/0x1a0
       process_one_work+0x236/0x530
       worker_thread+0x52/0x3b0
       kthread+0x135/0x160
       ret_from_fork+0x1f/0x30

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock((work_completion)(&(&ctx->rsrc_put_work)->work));
                               lock(&ctx->uring_lock);
                               lock((work_completion)(&(&ctx->rsrc_put_work)->work));
  lock(&ctx->uring_lock);

 *** DEADLOCK ***

2 locks held by kworker/2:4/2684:
 #0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
 #1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530

stack backtrace:
CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G           OE     5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5
Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015
Workqueue: events io_rsrc_put_work
Call Trace:
 dump_stack_lvl+0x6a/0x9a
 check_noncircular+0xfe/0x110
 __lock_acquire+0x119a/0x1e10
 lock_acquire+0xc8/0x2f0
 ? io_rsrc_put_work+0x13d/0x1a0
 __mutex_lock+0x86/0x740
 ? io_rsrc_put_work+0x13d/0x1a0
 ? io_rsrc_put_work+0x13d/0x1a0
 ? io_rsrc_put_work+0x13d/0x1a0
 ? process_one_work+0x1ce/0x530
 io_rsrc_put_work+0x13d/0x1a0
 process_one_work+0x236/0x530
 worker_thread+0x52/0x3b0
 ? process_one_work+0x530/0x530
 kthread+0x135/0x160
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x1f/0x30

which is due to holding the ctx->uring_lock when flushing existing
pending work, while the pending work flushing may need to grab the uring
lock if we're using IOPOLL.

Fix this by dropping the uring_lock a bit earlier as part of the flush.

Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/404
Tested-by: Ammar Faizi <ammarfaizi2@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09 19:59:06 -06:00
Hao Xu
47cae0c71f io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker()
There may be cases like:
        A                                 B
spin_lock(wqe->lock)
nr_workers is 0
nr_workers++
spin_unlock(wqe->lock)
                                     spin_lock(wqe->lock)
                                     nr_wokers is 1
                                     nr_workers++
                                     spin_unlock(wqe->lock)
create_io_worker()
  acct->worker is 1
                                     create_io_worker()
                                       acct->worker is 1

There should be one worker marked IO_WORKER_F_FIXED, but no one is.
Fix this by introduce a new agrument for create_io_worker() to indicate
if it is the first worker.

Fixes: 3d4e4face9 ("io-wq: fix no lock protection of acct->nr_worker")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210808135434.68667-3-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09 19:59:06 -06:00
Hao Xu
49e7f0c789 io-wq: fix bug of creating io-wokers unconditionally
The former patch to add check between nr_workers and max_workers has a
bug, which will cause unconditionally creating io-workers. That's
because the result of the check doesn't affect the call of
create_io_worker(), fix it by bringing in a boolean value for it.

Fixes: 21698274da ("io-wq: fix lack of acct->nr_workers < acct->max_workers judgement")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210808135434.68667-2-haoxu@linux.alibaba.com
[axboe: drop hunk that isn't strictly needed]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09 19:59:06 -06:00
Jens Axboe
4956b9eaad io_uring: rsrc ref lock needs to be IRQ safe
Nadav reports running into the below splat on re-enabling softirqs:

WARNING: CPU: 2 PID: 1777 at kernel/softirq.c:364 __local_bh_enable_ip+0xaa/0xe0
Modules linked in:
CPU: 2 PID: 1777 Comm: umem Not tainted 5.13.1+ #161
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
RIP: 0010:__local_bh_enable_ip+0xaa/0xe0
Code: a9 00 ff ff 00 74 38 65 ff 0d a2 21 8c 7a e8 ed 1a 20 00 fb 66 0f 1f 44 00 00 5b 41 5c 5d c3 65 8b 05 e6 2d 8c 7a 85 c0 75 9a <0f> 0b eb 96 e8 2d 1f 20 00 eb a5 4c 89 e7 e8 73 4f 0c 00 eb ae 65
RSP: 0018:ffff88812e58fcc8 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000201 RCX: dffffc0000000000
RDX: 0000000000000007 RSI: 0000000000000201 RDI: ffffffff8898c5ac
RBP: ffff88812e58fcd8 R08: ffffffff8575dbbf R09: ffffed1028ef14f9
R10: ffff88814778a7c3 R11: ffffed1028ef14f8 R12: ffffffff85c9e9ae
R13: ffff88814778a000 R14: ffff88814778a7b0 R15: ffff8881086db890
FS:  00007fbcfee17700(0000) GS:ffff8881e0300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000c0402a5008 CR3: 000000011c1ac003 CR4: 00000000003706e0
Call Trace:
 _raw_spin_unlock_bh+0x31/0x40
 io_rsrc_node_ref_zero+0x13e/0x190
 io_dismantle_req+0x215/0x220
 io_req_complete_post+0x1b8/0x720
 __io_complete_rw.isra.0+0x16b/0x1f0
 io_complete_rw+0x10/0x20

where it's clear we end up calling the percpu count release directly
from the completion path, as it's in atomic mode and we drop the last
ref. For file/block IO, this can be from IRQ context already, and the
softirq locking for rsrc isn't enough.

Just make the lock fully IRQ safe, and ensure we correctly safe state
from the release path as we don't know the full context there.

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Tested-by: Nadav Amit <nadav.amit@gmail.com>
Link: https://lore.kernel.org/io-uring/C187C836-E78B-4A31-B24C-D16919ACA093@gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09 19:58:59 -06:00
Sven Schnelle
f153c22467 ucounts: add missing data type changes
commit f9c82a4ea8 ("Increase size of ucounts to atomic_long_t")
changed the data type of ucounts/ucounts_max to long, but missed to
adjust a few other places. This is noticeable on big endian platforms
from user space because the /proc/sys/user/max_*_names files all
contain 0.

v4 - Made the min and max constants long so the sysctl values
     are actually settable on little endian machines.
     -- EWB

Fixes: f9c82a4ea8 ("Increase size of ucounts to atomic_long_t")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Acked-by: Alexey Gladkov <legion@kernel.org>
v1: https://lkml.kernel.org/r/20210721115800.910778-1-svens@linux.ibm.com
v2: https://lkml.kernel.org/r/20210721125233.1041429-1-svens@linux.ibm.com
v3: https://lkml.kernel.org/r/20210730062854.3601635-1-svens@linux.ibm.com
Link: https://lkml.kernel.org/r/8735rijqlv.fsf_-_@disp2133
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-08-09 15:45:02 -05:00
Bart Van Assche
769f526767 configfs: restore the kernel v5.13 text attribute write behavior
Instead of appending new text attribute data at the offset specified by the
write() system call, only pass the newly written data to the .store()
callback.

Reported-by: Bodo Stroesser <bostroesser@gmail.com>
Tested-by: Bodo Stroesser <bostroesser@gmail.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-09 16:56:00 +02:00
Nadav Amit
20c0b380f9 io_uring: Use WRITE_ONCE() when writing to sq_flags
The compiler should be forbidden from any strange optimization for async
writes to user visible data-structures. Without proper protection, the
compiler can cause write-tearing or invent writes that would confuse the
userspace.

However, there are writes to sq_flags which are not protected by
WRITE_ONCE(). Use WRITE_ONCE() for these writes.

This is purely a theoretical issue. Presumably, any compiler is very
unlikely to do such optimizations.

Fixes: 75b28affdd ("io_uring: allocate the two rings together")
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Link: https://lore.kernel.org/r/20210808001342.964634-3-namit@vmware.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-08 21:21:11 -06:00
Nadav Amit
ef98eb0409 io_uring: clear TIF_NOTIFY_SIGNAL when running task work
When using SQPOLL, the submission queue polling thread calls
task_work_run() to run queued work. However, when work is added with
TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
set afterwards and is never cleared.

Consequently, when the submission queue polling thread checks whether
signal_pending(), it may always find a pending signal, if
task_work_add() was ever called before.

The impact of this bug might be different on different kernel versions.
It appears that on 5.14 it would only cause unnecessary calculation and
prevent the polling thread from sleeping. On 5.13, where the bug was
found, it stops the polling thread from finding newly submitted work.

Instead of task_work_run(), use tracehook_notify_signal() that clears
TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to
current->task_works to avoid a race in which task_works is cleared but
the TIF_NOTIFY_SIGNAL is set.

Fixes: 685fe7feed ("io-wq: eliminate the need for a manager thread")
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Link: https://lore.kernel.org/r/20210808001342.964634-2-namit@vmware.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-08 21:21:11 -06:00
Linus Torvalds
85a90500f9 io_uring-5.14-2021-08-07
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEOuKcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpk4WD/9XQIUM2WmJrWCdrGJ64hgWyOFMg7Zhzbwn
 1swI2yMRcAHVOphP3Iyavxn7JHk+AKRcRgnjsKBdIn1WykkRwoIGfYGhTaO2zD7E
 T6r8BH/qLkwJcbhA9KNmhA3gf/zEnJgvc6JOVMtSvAdO6FTSfrh1aQneb/xjr0hx
 Yuy4dUtx6AmXjUZ0lh4XCud8IOzOrvjF9cZHVq/Q+DP+m/MnlpZOVws9Ge0XeAgN
 t7I9NouiyHNt5Z1SIQITbPvAGjY6/z/j2FOuECxgfdo7VpNUgK+V3byaXLFwLjLf
 X7DI9G1pf0fNPOrJktS7WarLVX4NKw8nt1Og5iOhFp7MoBl6SMWWY/2YOESk172g
 k0KGnPeoF+NZGnCyWLs6cEk6X0KQODAwWLhiYUuDcP3C/3H9BNNIR3IkfRhhOoJZ
 1aTvMjYRa0uD/fRnHydEfMlLl9PyjnmVFD8WLwI70zUO0GpuhlfLIq0rWgtsOG35
 s/CJVDH4VoDnMsx5Pg+oXsywiKDhhVLP2y4D35B2ivFrrLy+pAfXI8t7+IVP5Xm4
 sMmVqEtZLBvYBxrbsO81cWI5Ddcc4/z+F/I5DfACdi26GZC/Pn1/L3Le+i4zLypz
 CEY8vSM9+qgApxElaWrNOPtOFWV/XUll3lRLNIT9QncqtdmDdkofBdSv9lNj+wRt
 pVhBn/B6NQ==
 =FAeG
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.14-2021-08-07' of git://git.kernel.dk/linux-block

Pull io_uring from Jens Axboe:
 "A few io-wq related fixes:

   - Fix potential nr_worker race and missing max_workers check from one
     path (Hao)

   - Fix race between worker exiting and new work queue (me)"

* tag 'io_uring-5.14-2021-08-07' of git://git.kernel.dk/linux-block:
  io-wq: fix lack of acct->nr_workers < acct->max_workers judgement
  io-wq: fix no lock protection of acct->nr_worker
  io-wq: fix race between worker exiting and activating free worker
2021-08-07 10:34:26 -07:00
Linus Torvalds
c9194f32bf A Regression fix, bug fix, and a comment cleanup for ext4.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmENbm0ACgkQ8vlZVpUN
 gaNEEQf7B8GXPqJvRgqhJwCdqmZqz8DB7dfjqT0SB99f1EL3VoeHEvo+yEgMqD3L
 cSYRFh4efEHgr51HSZoIPINqcU9hs86SvFmjd6jWIcnY/EJLd0g3e8aEWpJ3S5rR
 3avSC4tiDbn34GgDeoR2DFG6RsGbRxDUEEzkrd8h7Hx6q39s3aXdi89lBmBe8rg/
 lIVaeivZrZ7SfY/YFEziF0P7KurJNju6lGwqm0xAqu79J9QaabXMF1u5GPjUi2rw
 TIaLMSP6O5VQbQwskcTIhJlKSAB4aUIB+fMV5Zi2cCXAKGdzK24xdM5VbzOeKAAX
 1EwOE9GEyytpxD1P0zb8vVGsJW3wjQ==
 =hQAS
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "A regression fix, bug fix, and a comment cleanup for ext4"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix potential htree corruption when growing large_dir directories
  ext4: remove conflicting comment from __ext4_forget
  ext4: fix potential uninitialized access to retval in kmmpd
2021-08-06 12:55:07 -07:00
Theodore Ts'o
877ba3f729 ext4: fix potential htree corruption when growing large_dir directories
Commit b5776e7524 ("ext4: fix potential htree index checksum
corruption) removed a required restart when multiple levels of index
nodes need to be split.  Fix this to avoid directory htree corruptions
when using the large_dir feature.

Cc: stable@kernel.org # v5.11
Cc: Благодаренко Артём <artem.blagodarenko@gmail.com>
Fixes: b5776e7524 ("ext4: fix potential htree index checksum corruption)
Reported-by: Denis <denis@voxelsoft.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-06 13:00:49 -04:00
Hao Xu
21698274da io-wq: fix lack of acct->nr_workers < acct->max_workers judgement
There should be this judgement before we create an io-worker

Fixes: 685fe7feed ("io-wq: eliminate the need for a manager thread")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-06 08:28:18 -06:00
Hao Xu
3d4e4face9 io-wq: fix no lock protection of acct->nr_worker
There is an acct->nr_worker visit without lock protection. Think about
the case: two callers call io_wqe_wake_worker(), one is the original
context and the other one is an io-worker(by calling
io_wqe_enqueue(wqe, linked)), on two cpus paralelly, this may cause
nr_worker to be larger than max_worker.
Let's fix it by adding lock for it, and let's do nr_workers++ before
create_io_worker. There may be a edge cause that the first caller fails
to create an io-worker, but the second caller doesn't know it and then
quit creating io-worker as well:

say nr_worker = max_worker - 1
        cpu 0                        cpu 1
   io_wqe_wake_worker()          io_wqe_wake_worker()
      nr_worker < max_worker
      nr_worker++
      create_io_worker()         nr_worker == max_worker
         failed                  return
      return

But the chance of this case is very slim.

Fixes: 685fe7feed ("io-wq: eliminate the need for a manager thread")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
[axboe: fix unconditional create_io_worker() call]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-06 08:27:54 -06:00
Shyam Prasad N
7d3fc01796 cifs: create sd context must be a multiple of 8
We used to follow the rule earlier that the create SD context
always be a multiple of 8. However, with the change:
cifs: refactor create_sd_buf() and and avoid corrupting the buffer
...we recompute the length, and we failed that rule.
Fixing that with this change.

Cc: <stable@vger.kernel.org> # v5.10+
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-05 12:48:42 -05:00
Alex Xu (Hello71)
46c4c9d1be pipe: increase minimum default pipe size to 2 pages
This program always prints 4096 and hangs before the patch, and always
prints 8192 and exits successfully after:

  int main()
  {
      int pipefd[2];
      for (int i = 0; i < 1025; i++)
          if (pipe(pipefd) == -1)
              return 1;
      size_t bufsz = fcntl(pipefd[1], F_GETPIPE_SZ);
      printf("%zd\n", bufsz);
      char *buf = calloc(bufsz, 1);
      write(pipefd[1], buf, bufsz);
      read(pipefd[0], buf, bufsz-1);
      write(pipefd[1], buf, 1);
  }

Note that you may need to increase your RLIMIT_NOFILE before running the
program.

Fixes: 759c01142a ("pipe: limit the per-user amount of pages allocated in pipes")
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/lkml/1628086770.5rn8p04n6j.none@localhost/
Link: https://lore.kernel.org/lkml/1628127094.lxxn016tj7.none@localhost/
Signed-off-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-05 10:30:47 -07:00
Jens Axboe
83d6c39310 io-wq: fix race between worker exiting and activating free worker
Nadav correctly reports that we have a race between a worker exiting,
and new work being queued. This can lead to work being queued behind
an existing worker that could be sleeping on an event before it can
run to completion, and hence introducing potential big latency gaps
if we hit this race condition:

cpu0					cpu1
----					----
					io_wqe_worker()
					schedule_timeout()
					 // timed out
io_wqe_enqueue()
io_wqe_wake_worker()
// work_flags & IO_WQ_WORK_CONCURRENT
io_wqe_activate_free_worker()
					 io_worker_exit()

Fix this by having the exiting worker go through the normal decrement
of a running worker, which will spawn a new one if needed.

The free worker activation is modified to only return success if we
were able to find a sleeping worker - if not, we keep looking through
the list. If we fail, we create a new worker as per usual.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/io-uring/BFF746C0-FEDE-4646-A253-3021C57C26C9@gmail.com/
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Tested-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-04 14:34:46 -06:00
Jeff Layton
8434ffe71c ceph: take snap_empty_lock atomically with snaprealm refcount change
There is a race in ceph_put_snap_realm. The change to the nref and the
spinlock acquisition are not done atomically, so you could decrement
nref, and before you take the spinlock, the nref is incremented again.
At that point, you end up putting it on the empty list when it
shouldn't be there. Eventually __cleanup_empty_realms runs and frees
it when it's still in-use.

Fix this by protecting the 1->0 transition with atomic_dec_and_lock,
and just drop the spinlock if we can get the rwsem.

Because these objects can also undergo a 0->1 refcount transition, we
must protect that change as well with the spinlock. Increment locklessly
unless the value is at 0, in which case we take the spinlock, increment
and then take it off the empty list if it did the 0->1 transition.

With these changes, I'm removing the dout() messages from these
functions, as well as in __put_snap_realm. They've always been racy, and
it's better to not print values that may be misleading.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46419
Reported-by: Mark Nelson <mnelson@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-04 19:20:29 +02:00
Luis Henriques
bf2ba43221 ceph: reduce contention in ceph_check_delayed_caps()
Function ceph_check_delayed_caps() is called from the mdsc->delayed_work
workqueue and it can be kept looping for quite some time if caps keep
being added back to the mdsc->cap_delay_list.  This may result in the
watchdog tainting the kernel with the softlockup flag.

This patch breaks this loop if the caps have been recently (i.e. during
the loop execution).  Any new caps added to the list will be handled in
the next run.

Also, allow schedule_delayed() callers to explicitly set the delay value
instead of defaulting to 5s, so we can ensure that it runs soon
afterward if it looks like there is more work.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46284
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-04 19:20:05 +02:00
Linus Torvalds
aa6603266c Fixes for 5.14-rc4:
* Fix a number of coordination bugs relating to cache flushes for
    metadata writeback, cache flushes for multi-buffer log writes, and
    FUA writes for single-buffer log writes.
  * Fix a bug with incorrect replay of attr3 blocks.
  * Fix unnecessary stalls when flushing logs to disk.
  * Fix spoofing problems when recovering realtime bitmap blocks.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmEC2PgACgkQ+H93GTRK
 tOtEvg//XQTcqKgO+60lJzhfgfGD8HsYWGcAc0UW8vu0I6gPNstd/PHKBCYkhT66
 rp0l8CtZhbo3qj2ZJTIDvVxFeeAUcMhAIgU4gJB6OmW6/VV8NJlArfeyaA+85/lV
 lVYD53qBcc0IydDWlRD5oU8T55pqv9hg0W9WkpWrtjxoTlxPX5rDj7yrKEqiQs1M
 IUa5X4Qnwo/C2ATD/t2G3PIM7OxdCJ7YjyrZ27VWWRsUJW8DOqXtJX6HBs+VT9cM
 mh/IeIy60rmKgf2Ag2ZJCvrKnmqXqJFyGjEDzk6gXoqktQyWnUBLhQoyLh5r9UlA
 4ThLGvPwUh5QEFOoo3cpN72X0wUeHcebfh4DgY/G3PeEK4J1CVq1UXLB1a8Si7X4
 qf5ZqfUU4dr6v8C2AIqd9S/H6wm8v84hzA2uXca9tsw67rAcLc6N0rHydlLtn+n8
 DL4PQYcUmn0LGrhIi2t/4ec80SGBf7ad/iDbr3A0K5NsV5kMl8dReg2yCDl9kHM0
 yHFk8zLTKh5fs7fmmJXOORP33YMzstET9L1oKBv9cd9iMlHNUn27o9tpwwa2noM+
 v6E+UCKlRTauj/MTxZITdmNzgGEymgu5bpbb77N24OTF9jf48OEW+cr0ZzgrVYtk
 wGuj9RFGcwneJoWjVPGURu1xBuC1AX9PbqnR9NQXbqmuwd6BINk=
 =pLW3
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "This contains a bunch of bug fixes in XFS.

  Dave and I have been busy the last couple of weeks to find and fix as
  many log recovery bugs as we can find; here are the results so far. Go
  fstests -g recoveryloop! ;)

   - Fix a number of coordination bugs relating to cache flushes for
     metadata writeback, cache flushes for multi-buffer log writes, and
     FUA writes for single-buffer log writes

   - Fix a bug with incorrect replay of attr3 blocks

   - Fix unnecessary stalls when flushing logs to disk

   - Fix spoofing problems when recovering realtime bitmap blocks"

* tag 'xfs-5.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: prevent spoofing of rtbitmap blocks when recovering buffers
  xfs: limit iclog tail updates
  xfs: need to see iclog flags in tracing
  xfs: Enforce attr3 buffer recovery order
  xfs: logging the on disk inode LSN can make it go backwards
  xfs: avoid unnecessary waits in xfs_log_force_lsn()
  xfs: log forces imply data device cache flushes
  xfs: factor out forced iclog flushes
  xfs: fix ordering violation between cache flushes and tail updates
  xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog
  xfs: external logs need to flush data device
  xfs: flush data dev on external log write
2021-08-01 12:07:23 -07:00
Linus Torvalds
f3438b4c4e 3 cifs/smb3 fixes, including two for stable, and a fix for an fallocate problem noticed by Clang
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmEEaj8ACgkQiiy9cAdy
 T1FdSgv+KaGjM/IPPHDymSoWEtkPp3x3frPEXfJrTSD/M8NVx+9UEnwH8JEZEn+t
 PEQyGxOx28qo8ZtNJKmKAAkAm45/gTriteyjk8m2witJdDaK7wyhRYJ3Mmo99ot5
 hOV55Jn40N+dK4r2AfWv6WauiDnLw83QlSmm5eN1/om3FXFx49IOjeKGaagpHGSQ
 MaaEX0pMcyMop/a1TFOC01M8ergYfWF/3S3zmA5S9Ei4+PefxbhYpdj4RTRZY7rp
 +2ZNyAs4Zc88l6weF4XCNck/2qrW6A4a1DCmnD19MJTGK+4FqqIBZ0f5fWV6aFGl
 EIi6vsHnt4b2MGkePs/SrpifCMZk/uSM7p9TQvplX0KHhIJ5h55pO0+GFdH0uYlv
 KkV2SjepCavsn8pwhVyqMRQ8+/MCQi+vAAP9vmdD2+V0cps8VFBexVSfuWEdwSBd
 JuDWeVP1kgR4OQGrCp4r8wWhSCh+FOGsdrwTHTwPWL9GCoUuiTui6Z36JM5fT6Dg
 437Hg7RY
 =qJCo
 -----END PGP SIGNATURE-----

Merge tag '5.14-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Three cifs/smb3 fixes, including two for stable, and a fix for an
  fallocate problem noticed by Clang"

* tag '5.14-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: add missing parsing of backupuid
  smb3: rc uninitialized in one fallocate path
  SMB3: fix readpage for large swap cache
2021-07-31 09:25:12 -07:00
Linus Torvalds
3a34b13a88 pipe: make pipe writes always wake up readers
Since commit 1b6b26ae70 ("pipe: fix and clarify pipe write wakeup
logic") we have sanitized the pipe write logic, and would only try to
wake up readers if they needed it.

In particular, if the pipe already had data in it before the write,
there was no point in trying to wake up a reader, since any existing
readers must have been aware of the pre-existing data already.  Doing
extraneous wakeups will only cause potential thundering herd problems.

However, it turns out that some Android libraries have misused the EPOLL
interface, and expected "edge triggered" be to "any new write will
trigger it".  Even if there was no edge in sight.

Quoting Sandeep Patil:
 "The commit 1b6b26ae70 ('pipe: fix and clarify pipe write wakeup
  logic') changed pipe write logic to wakeup readers only if the pipe
  was empty at the time of write. However, there are libraries that
  relied upon the older behavior for notification scheme similar to
  what's described in [1]

  One such library 'realm-core'[2] is used by numerous Android
  applications. The library uses a similar notification mechanism as GNU
  Make but it never drains the pipe until it is full. When Android moved
  to v5.10 kernel, all applications using this library stopped working.

  The library has since been fixed[3] but it will be a while before all
  applications incorporate the updated library"

Our regression rule for the kernel is that if applications break from
new behavior, it's a regression, even if it was because the application
did something patently wrong.  Also note the original report [4] by
Michal Kerrisk about a test for this epoll behavior - but at that point
we didn't know of any actual broken use case.

So add the extraneous wakeup, to approximate the old behavior.

[ I say "approximate", because the exact old behavior was to do a wakeup
  not for each write(), but for each pipe buffer chunk that was filled
  in. The behavior introduced by this change is not that - this is just
  "every write will cause a wakeup, whether necessary or not", which
  seems to be sufficient for the broken library use. ]

It's worth noting that this adds the extraneous wakeup only for the
write side, while the read side still considers the "edge" to be purely
about reading enough from the pipe to allow further writes.

See commit f467a6a664 ("pipe: fix and clarify pipe read wakeup logic")
for the pipe read case, which remains that "only wake up if the pipe was
full, and we read something from it".

Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
Link: https://github.com/realm/realm-core [2]
Link: https://github.com/realm/realm-core/issues/4666 [3]
Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/
Reported-by: Sandeep Patil <sspatil@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30 15:42:34 -07:00
Linus Torvalds
4669e13cd6 block-5.14-2021-07-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEEEz8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpglXD/9CGREpOf1W5oqOScpTygjehwrRnAYisQv6
 Oca/qGHBa61BTN3taAJc4NMwl+IwFBER2kdTcOyz8hNmyAUPyRmFND0mG2vGTzQA
 P9+ekiRKCJ1aRLsnyBL0JBbmvdoPMBHz39P165vMWMrVmnlpcPKoYDS0itHtYYNP
 VD5Y3A9ACGMDglipDmL+3tsXQo/AoJqRO8WGMUBY2qJ0lasYuCbPpzq0kHzXi6kE
 0X64bg6JOZVd3wdyWywKahW3ntsVNLswRUBzLVrnjwE29UuBGWgF+/vwyW/Ob0yS
 ojafKvehCYnV8Q7IatASOtbwGLvLKgpJZXf7VUEsYnSD6SnmoZctjMjRdyLhNWut
 lD86Y+eWjQM0pUsOVPykfrV2hd9CrhjyRFskcbI0SJRlMOl0Lstl/X17efDWcDmz
 1/V8ub3gKA3HF2Gc/QKhPJDClxM7SaWnsAO3Rk+qJ6bT4EiiRg2GewI1C7YNpmGW
 ty1fqcQE36JtSWadH4KL/evmX258ROfn3QT1nut2jpNsd1RQ+hHBcjcfeOx6n1GX
 ALxT8LnmlVYbAUwQvXJcqFcft8K3JoB5ZXT74lat/CAbIKhfEUeSUiqnQcQ8kJLW
 MTKviuZ9eJHO6/E7vw08ARDR0PmpSFqvc6rK9DiIM/kmVDz8OdLMovTqzX/hIzUT
 7IfyHzQbwg==
 =5FG2
 -----END PGP SIGNATURE-----

Merge tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - gendisk freeing fix (Christoph)

 - blk-iocost wake ordering fix (Tejun)

 - tag allocation error handling fix (John)

 - loop locking fix. While this isn't the prettiest fix in the world,
   nobody has any good alternatives for 5.14. Something to likely
   revisit for 5.15. (Tetsuo)

* tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
  block: delay freeing the gendisk
  blk-iocost: fix operation ordering in iocg_wake_fn()
  blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling
  loop: reintroduce global lock for safe loop_validate_file() traversal
2021-07-30 11:08:12 -07:00
Linus Torvalds
27eb687bcd io_uring-5.14-2021-07-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEEE1IQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpm3jD/9Tx+VeLEmfapYdZXziDWrVecwm0iWfSoT8
 ibKJGIYHmCH1XxIVT16+a1HcJAra7NGS60zW6JAvBh5ZWL/smEriOu+R6Twa1ez2
 9Gby+39+V0PP3x9sKtRp7TsmsA0paIqVG4zwfyaUCyvYfiSTURoYde4lZwSGEMhb
 8b0FPju/hmN/iyRGtu1eQvTbp252vahkGE8PKYkZWxNkTdJpvRax3kmbjH3A8/X3
 rr0KDMgk4ePVap+3i/h94rXweaLCq9KiSm95Zvs63me6J2CbpKz/hGtvR5TiENiI
 0mEqkA9PVE6LRbF4T6gk2gGYrkfEiAzca0r8BdoY0TxuEb2SfK5P3JOfYxS4VHek
 lVc+lvm0YglRn6dWxdUMhXYTrlAx4cRnIM9Oqg/WzN4WCZREouE2J/MXHNy3lNfQ
 zQQkmbNzeLQHaiq/JWAtu39LAeHWwEC/FernVV1i0wYOT6EacpVNM6OLUmHQGzkZ
 mnQpc9AzCgLifNZb4DDlP0MyM1D+GXGm5tdozmUILFQoAnvA6+3EtnRxOH9cWLaa
 mSNenp5kag/nbdkFTo3X+ptGYgLBWEluT/dKMsoqulPu+ZCV4zh9rAIgWqQUYwa2
 5z/d6OAr3V8hjChiF4a7JrRUISu0f0Eh/GRCqTSp97Hys5rJcDxGB9WQKv7u+sTc
 BArhbyoHcg==
 =pzj2
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - A fix for block backed reissue (me)

 - Reissue context hardening (me)

 - Async link locking fix (Pavel)

* tag 'io_uring-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
  io_uring: fix poll requests leaking second poll entries
  io_uring: don't block level reissue off completion path
  io_uring: always reissue from task_work context
  io_uring: fix race in unified task_work running
  io_uring: fix io_prep_async_link locking
2021-07-30 11:01:47 -07:00
Linus Torvalds
051df241e4 for-5.14-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEEDKIACgkQxWXV+ddt
 WDtW+BAAnUD7h3ollIQo4C6hE9WaTG49Tp12Z00Og2m8hn4XyhI2QIaDz6a2CU7n
 MLQv16vZUQk5Z/VMtczM+5ZF5Rf0ywlMXnS4Sq5yKWT0YHpnH7q2nMAvg4gql/tJ
 Ldov92hnTrFAZX6vvkLVM5lZriY7fop3Lv2vHeAKu4CymAoisAv+SLa5xYkBR6Ig
 3S16+lh/rIRgssI7KuDnjp9iTXvnB1J2MbfAOLNfqjXGWUDumu1k7HWQSNYZnHJX
 L390/QS3F3K6Trxkf5MSUXOxQROqcGKQVKyAR5ZvyULKly84nDpiINze80yCopq/
 7//32pO43xDPb78c7saxSWtjdgX4XsBOdzIoiJZHnc5CTTbCcneLes8zz4fD6AGq
 vjZKDLTgiO/sRlkQHZQk1y+7CawrqbKkAG+O7MqF7KGOtQ1WLRGfAkFP732TBFXM
 TyoZ7ENh3TiFDdeRmkOonpQ2k3DctW+7z2BmdlsuSXgD8fFbEArfxnO1SnRHrmcr
 C8FNeSkks8MTL7uePNUxwlnB8uHuGWCgSuS++q4OkCnzA3AmO6cRlDoMT3RMwVB/
 wQxvqF/U6JJx16YOVqwA6ZjuUWVwyBj/WBKlaxgfghz8CUmDC0D4Xb2/S1UVcZi6
 bFRph0UKeE5LaduoNZYaAqMOinCXFmetjudPmWO4sWfPrLb1mOY=
 =J0Pw
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix -Warray-bounds warning, to help external patchset to make it
   default treewide

 - fix writeable device accounting (syzbot report)

 - fix fsync and log replay after a rename and inode eviction

 - fix potentially lost error code when submitting multiple bios for
   compressed range

* tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: calculate number of eb pages properly in csum_tree_block
  btrfs: fix rw device counting in __btrfs_free_extra_devids
  btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
  btrfs: mark compressed range uptodate only if all bio succeed
2021-07-30 10:50:09 -07:00
Linus Torvalds
ad6ec09d96 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "7 patches.

  Subsystems affected by this patch series: lib, ocfs2, and mm (slub,
  migration, and memcg)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()
  slub: fix unreclaimable slab stat for bulk free
  mm/migrate: fix NR_ISOLATED corruption on 64-bit
  mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code
  ocfs2: issue zeroout to EOF blocks
  ocfs2: fix zero out valid data
  lib/test_string.c: move string selftest in the Runtime Testing menu
2021-07-30 10:29:58 -07:00
Junxiao Bi
9449ad33be ocfs2: issue zeroout to EOF blocks
For punch holes in EOF blocks, fallocate used buffer write to zero the
EOF blocks in last cluster.  But since ->writepage will ignore EOF
pages, those zeros will not be flushed.

This "looks" ok as commit 6bba4471f0 ("ocfs2: fix data corruption by
fallocate") will zero the EOF blocks when extend the file size, but it
isn't.  The problem happened on those EOF pages, before writeback, those
pages had DIRTY flag set and all buffer_head in them also had DIRTY flag
set, when writeback run by write_cache_pages(), DIRTY flag on the page
was cleared, but DIRTY flag on the buffer_head not.

When next write happened to those EOF pages, since buffer_head already
had DIRTY flag set, it would not mark page DIRTY again.  That made
writeback ignore them forever.  That will cause data corruption.  Even
directio write can't work because it will fail when trying to drop pages
caches before direct io, as it found the buffer_head for those pages
still had DIRTY flag set, then it will fall back to buffer io mode.

To make a summary of the issue, as writeback ingores EOF pages, once any
EOF page is generated, any write to it will only go to the page cache,
it will never be flushed to disk even file size extends and that page is
not EOF page any more.  The fix is to avoid zero EOF blocks with buffer
write.

The following code snippet from qemu-img could trigger the corruption.

  656   open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
  ...
  660   fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
  660   fallocate(11, 0, 2275868672, 327680) = 0
  658   pwrite64(11, "

Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30 10:14:39 -07:00
Junxiao Bi
f267aeb6de ocfs2: fix zero out valid data
If append-dio feature is enabled, direct-io write and fallocate could
run in parallel to extend file size, fallocate used "orig_isize" to
record i_size before taking "ip_alloc_sem", when
ocfs2_zeroout_partial_cluster() zeroout EOF blocks, i_size maybe already
extended by ocfs2_dio_end_io_write(), that will cause valid data zeroed
out.

Link: https://lkml.kernel.org/r/20210722054923.24389-1-junxiao.bi@oracle.com
Fixes: 6bba4471f0 ("ocfs2: fix data corruption by fallocate")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-30 10:14:39 -07:00
Linus Torvalds
cade08a572 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha
Pull alpha updates from Matt Turner:
 "They're mostly small janitorial fixes but there's also more important
  ones:

   - drop the alpha-specific x86 binary loader (David Hildenbrand)

   - regression fix for at least Marvel platforms (Mike Rapoport)

   - fix for a scary-looking typo (Zheng Yongjun)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha:
  alpha: register early reserved memory in memblock
  alpha: fix spelling mistakes
  alpha: Remove space between * and parameter name
  alpha: fp_emul: avoid init/cleanup_module names
  alpha: Add syscall_get_return_value()
  binfmt: remove support for em86 (alpha only)
  alpha: fix typos in a comment
  alpha: defconfig: add necessary configs for boot testing
  alpha: Send stop IPI to send to online CPUs
  alpha: convert comma to semicolon
  alpha: remove undef inline in compiler.h
  alpha: Kconfig: Replace HTTP links with HTTPS ones
  alpha: __udiv_qrnnd should be exported
2021-07-29 20:57:56 -07:00
Darrick J. Wong
81a448d7b0 xfs: prevent spoofing of rtbitmap blocks when recovering buffers
While reviewing the buffer item recovery code, the thought occurred to
me: in V5 filesystems we use log sequence number (LSN) tracking to avoid
replaying older metadata updates against newer log items.  However, we
use the magic number of the ondisk buffer to find the LSN of the ondisk
metadata, which means that if an attacker can control the layout of the
realtime device precisely enough that the start of an rt bitmap block
matches the magic and UUID of some other kind of block, they can control
the purported LSN of that spoofed block and thereby break log replay.

Since realtime bitmap and summary blocks don't have headers at all, we
have no way to tell if a block really should be replayed.  The best we
can do is replay unconditionally and hope for the best.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-07-29 09:27:29 -07:00
Dave Chinner
9d11001420 xfs: limit iclog tail updates
From the department of "generic/482 keeps on giving", we bring you
another tail update race condition:

iclog:
	S1			C1
	+-----------------------+-----------------------+
				 S2			EOIC

Two checkpoints in a single iclog. One is complete, the other just
contains the start record and overruns into a new iclog.

Timeline:

Before S1:	Cache flush, log tail = X
At S1:		Metadata stable, write start record and checkpoint
At C1:		Write commit record, set NEED_FUA
		Single iclog checkpoint, so no need for NEED_FLUSH
		Log tail still = X, so no need for NEED_FLUSH

After C1,
Before S2:	Cache flush, log tail = X
At S2:		Metadata stable, write start record and checkpoint
After S2:	Log tail moves to X+1
At EOIC:	End of iclog, more journal data to write
		Releases iclog
		Not a commit iclog, so no need for NEED_FLUSH
		Writes log tail X+1 into iclog.

At this point, the iclog has tail X+1 and NEED_FUA set. There has
been no cache flush for the metadata between X and X+1, and the
iclog writes the new tail permanently to the log. THis is sufficient
to violate on disk metadata/journal ordering.

We have two options here. The first is to detect this case in some
manner and ensure that the partial checkpoint write sets NEED_FLUSH
when the iclog is already marked NEED_FUA and the log tail changes.
This seems somewhat fragile and quite complex to get right, and it
doesn't actually make it obvious what underlying problem it is
actually addressing from reading the code.

The second option seems much cleaner to me, because it is derived
directly from the requirements of the C1 commit record in the iclog.
That is, when we write this commit record to the iclog, we've
guaranteed that the metadata/data ordering is correct for tail
update purposes. Hence if we only write the log tail into the iclog
for the *first* commit record rather than the log tail at the last
release, we guarantee that the log tail does not move past where the
the first commit record in the log expects it to be.

IOWs, taking the first option means that replay of C1 becomes
dependent on future operations doing the right thing, not just the
C1 checkpoint itself doing the right thing. This makes log recovery
almost impossible to reason about because now we have to take into
account what might or might not have happened in the future when
looking at checkpoints in the log rather than just having to
reconstruct the past...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29 09:27:29 -07:00
Dave Chinner
b2ae3a9ef9 xfs: need to see iclog flags in tracing
Because I cannot tell if the NEED_FLUSH flag is being set correctly
by the log force and CIL push machinery without it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29 09:27:29 -07:00
Dave Chinner
d8f4c2d039 xfs: Enforce attr3 buffer recovery order
From the department of "WTAF? How did we miss that!?"...

When we are recovering a buffer, the first thing we do is check the
buffer magic number and extract the LSN from the buffer. If the LSN
is older than the current LSN, we replay the modification to it. If
the metadata on disk is newer than the transaction in the log, we
skip it. This is a fundamental v5 filesystem metadata recovery
behaviour.

generic/482 failed with an attribute writeback failure during log
recovery. The write verifier caught the corruption before it got
written to disk, and the attr buffer dump looked like:

XFS (dm-3): Metadata corruption detected at xfs_attr3_leaf_verify+0x275/0x2e0, xfs_attr3_leaf block 0x19be8
XFS (dm-3): Unmount and run xfs_repair
XFS (dm-3): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 4d 2a 01 e1  ........;...M*..
00000010: 00 00 00 00 00 01 9b e8 00 00 00 01 00 00 05 38  ...............8
                                  ^^^^^^^^^^^^^^^^^^^^^^^
00000020: df 39 5e 51 58 ac 44 b6 8d c5 e7 10 44 09 bc 17  .9^QX.D.....D...
00000030: 00 00 00 00 00 02 00 83 00 03 00 cc 0f 24 01 00  .............$..
00000040: 00 68 0e bc 0f c8 00 10 00 00 00 00 00 00 00 00  .h..............
00000050: 00 00 3c 31 0f 24 01 00 00 00 3c 32 0f 88 01 00  ..<1.$....<2....
00000060: 00 00 3c 33 0f d8 01 00 00 00 00 00 00 00 00 00  ..<3............
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
.....

The highlighted bytes are the LSN that was replayed into the
buffer: 0x100000538. This is cycle 1, block 0x538. Prior to replay,
that block on disk looks like this:

$ sudo xfs_db -c "fsb 0x417d" -c "type attr3" -c p /dev/mapper/thin-vol
hdr.info.hdr.forw = 0
hdr.info.hdr.back = 0
hdr.info.hdr.magic = 0x3bee
hdr.info.crc = 0xb5af0bc6 (correct)
hdr.info.bno = 105448
hdr.info.lsn = 0x100000900
               ^^^^^^^^^^^
hdr.info.uuid = df395e51-58ac-44b6-8dc5-e7104409bc17
hdr.info.owner = 131203
hdr.count = 2
hdr.usedbytes = 120
hdr.firstused = 3796
hdr.holes = 1
hdr.freemap[0-2] = [base,size]

Note the LSN stamped into the buffer on disk: 1/0x900. The version
on disk is much newer than the log transaction that was being
replayed. That's a bug, and should -never- happen.

So I immediately went to look at xlog_recover_get_buf_lsn() to check
that we handled the LSN correctly. I was wondering if there was a
similar "two commits with the same start LSN skips the second
replay" problem with buffers. I didn't get that far, because I found
a much more basic, rudimentary bug: xlog_recover_get_buf_lsn()
doesn't recognise buffers with XFS_ATTR3_LEAF_MAGIC set in them!!!

IOWs, attr3 leaf buffers fall through the magic number checks
unrecognised, so trigger the "recover immediately" behaviour instead
of undergoing an LSN check. IOWs, we incorrectly replay ATTR3 leaf
buffers and that causes silent on disk corruption of inode attribute
forks and potentially other things....

Git history shows this is *another* zero day bug, this time
introduced in commit 50d5c8d8e9 ("xfs: check LSN ordering for v5
superblocks during recovery") which failed to handle the attr3 leaf
buffers in recovery. And we've failed to handle them ever since...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29 09:27:29 -07:00
Dave Chinner
32baa63d82 xfs: logging the on disk inode LSN can make it go backwards
When we log an inode, we format the "log inode" core and set an LSN
in that inode core. We do that via xfs_inode_item_format_core(),
which calls:

	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);

to format the log inode. It writes the LSN from the inode item into
the log inode, and if recovery decides the inode item needs to be
replayed, it recovers the log inode LSN field and writes it into the
on disk inode LSN field.

Now this might seem like a reasonable thing to do, but it is wrong
on multiple levels. Firstly, if the item is not yet in the AIL,
item->li_lsn is zero. i.e. the first time the inode it is logged and
formatted, the LSN we write into the log inode will be zero. If we
only log it once, recovery will run and can write this zero LSN into
the inode.

This means that the next time the inode is logged and log recovery
runs, it will *always* replay changes to the inode regardless of
whether the inode is newer on disk than the version in the log and
that violates the entire purpose of recording the LSN in the inode
at writeback time (i.e. to stop it going backwards in time on disk
during recovery).

Secondly, if we commit the CIL to the journal so the inode item
moves to the AIL, and then relog the inode, the LSN that gets
stamped into the log inode will be the LSN of the inode's current
location in the AIL, not it's age on disk. And it's not the LSN that
will be associated with the current change. That means when log
recovery replays this inode item, the LSN that ends up on disk is
the LSN for the previous changes in the log, not the current
changes being replayed. IOWs, after recovery the LSN on disk is not
in sync with the LSN of the modifications that were replayed into
the inode. This, again, violates the recovery ordering semantics
that on-disk writeback LSNs provide.

Hence the inode LSN in the log dinode is -always- invalid.

Thirdly, recovery actually has the LSN of the log transaction it is
replaying right at hand - it uses it to determine if it should
replay the inode by comparing it to the on-disk inode's LSN. But it
doesn't use that LSN to stamp the LSN into the inode which will be
written back when the transaction is fully replayed. It uses the one
in the log dinode, which we know is always going to be incorrect.

Looking back at the change history, the inode logging was broken by
commit 93f958f9c4 ("xfs: cull unnecessary icdinode fields") way
back in 2016 by a stupid idiot who thought he knew how this code
worked. i.e. me. That commit replaced an in memory di_lsn field that
was updated only at inode writeback time from the inode item.li_lsn
value - and hence always contained the same LSN that appeared in the
on-disk inode - with a read of the inode item LSN at inode format
time. CLearly these are not the same thing.

Before 93f958f9c4, the log recovery behaviour was irrelevant,
because the LSN in the log inode always matched the on-disk LSN at
the time the inode was logged, hence recovery of the transaction
would never make the on-disk LSN in the inode go backwards or get
out of sync.

A symptom of the problem is this, caught from a failure of
generic/482. Before log recovery, the inode has been allocated but
never used:

xfs_db> inode 393388
xfs_db> p
core.magic = 0x494e
core.mode = 0
....
v3.crc = 0x99126961 (correct)
v3.change_count = 0
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jan  1 10:00:00 1970
v3.crtime.nsec = 0

After log recovery:

xfs_db> p
core.magic = 0x494e
core.mode = 020444
....
v3.crc = 0x23e68f23 (correct)
v3.change_count = 2
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jul 22 17:03:03 2021
v3.crtime.nsec = 751000000
...

You can see that the LSN of the on-disk inode is 0, even though it
clearly has been written to disk. I point out this inode, because
the generic/482 failure occurred because several adjacent inodes in
this specific inode cluster were not replayed correctly and still
appeared to be zero on disk when all the other metadata (inobt,
finobt, directories, etc) indicated they should be allocated and
written back.

The fix for this is two-fold. The first is that we need to either
revert the LSN changes in 93f958f9c4 or stop logging the inode LSN
altogether. If we do the former, log recovery does not need to
change but we add 8 bytes of memory per inode to store what is
largely a write-only inode field. If we do the latter, log recovery
needs to stamp the on-disk inode in the same manner that inode
writeback does.

I prefer the latter, because we shouldn't really be trying to log
and replay changes to the on disk LSN as the on-disk value is the
canonical source of the on-disk version of the inode. It also
matches the way we recover buffer items - we create a buf_log_item
that carries the current recovery transaction LSN that gets stamped
into the buffer by the write verifier when it gets written back
when the transaction is fully recovered.

However, this might break log recovery on older kernels even more,
so I'm going to simply ignore the logged value in recovery and stamp
the on-disk inode with the LSN of the transaction being recovered
that will trigger writeback on transaction recovery completion. This
will ensure that the on-disk inode LSN always reflects the LSN of
the last change that was written to disk, regardless of whether it
comes from log recovery or runtime writeback.

Fixes: 93f958f9c4 ("xfs: cull unnecessary icdinode fields")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29 09:27:29 -07:00
Dave Chinner
8191d8222c xfs: avoid unnecessary waits in xfs_log_force_lsn()
Before waiting on a iclog in xfs_log_force_lsn(), we don't check to
see if the iclog has already been completed and the contents on
stable storage. We check for completed iclogs in xfs_log_force(), so
we should do the same thing for xfs_log_force_lsn().

This fixed some random up-to-30s pauses seen in unmounting
filesystems in some tests. A log force ends up waiting on completed
iclog, and that doesn't then get flushed (and hence the log force
get completed) until the background log worker issues a log force
that flushes the iclog in question. Then the unmount unblocks and
continues.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29 09:27:28 -07:00
Dave Chinner
2bf1ec0ff0 xfs: log forces imply data device cache flushes
After fixing the tail_lsn vs cache flush race, generic/482 continued
to fail in a similar way where cache flushes were missing before
iclog FUA writes. Tracing of iclog state changes during the fsstress
workload portion of the test (via xlog_iclog* events) indicated that
iclog writes were coming from two sources - CIL pushes and log
forces (due to fsync/O_SYNC operations). All of the cases where a
recovery problem was triggered indicated that the log force was the
source of the iclog write that was not preceeded by a cache flush.

This was an oversight in the modifications made in commit
eef983ffea ("xfs: journal IO cache flush reductions"). Log forces
for fsync imply a data device cache flush has been issued if an
iclog was flushed to disk and is indicated to the caller via the
log_flushed parameter so they can elide the device cache flush if
the journal issued one.

The change in eef983ffea results in iclogs only issuing a cache
flush if XLOG_ICL_NEED_FLUSH is set on the iclog, but this was not
added to the iclogs that the log force code flushes to disk. Hence
log forces are no longer guaranteeing that a cache flush is issued,
hence opening up a potential on-disk ordering failure.

Log forces should also set XLOG_ICL_NEED_FUA as well to ensure that
the actual iclogs it forces to the journal are also on stable
storage before it returns to the caller.

This patch introduces the xlog_force_iclog() helper function to
encapsulate the process of taking a reference to an iclog, switching
its state if WANT_SYNC and flushing it to stable storage correctly.

Both xfs_log_force() and xfs_log_force_lsn() are converted to use
it, as is xlog_unmount_write() which has an elaborate method of
doing exactly the same "write this iclog to stable storage"
operation.

Further, if the log force code needs to wait on a iclog in the
WANT_SYNC state, it needs to ensure that iclog also results in a
cache flush being issued. This covers the case where the iclog
contains the commit record of the CIL flush that the log force
triggered, but it hasn't been written yet because there is still an
active reference to the iclog.

Note: this whole cache flush whack-a-mole patch is a result of log
forces still being iclog state centric rather than being CIL
sequence centric. Most of this nasty code will go away in future
when log forces are converted to wait on CIL sequence push
completion rather than iclog completion. With the CIL push algorithm
guaranteeing that the CIL checkpoint is fully on stable storage when
it completes, we no longer need to iterate iclogs and push them to
ensure a CIL sequence push has completed and so all this nasty iclog
iteration and flushing code will go away.

Fixes: eef983ffea ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29 09:27:28 -07:00
Dave Chinner
45eddb4140 xfs: factor out forced iclog flushes
We force iclogs in several places - we need them all to have the
same cache flush semantics, so start by factoring out the iclog
force into a common helper.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29 09:27:28 -07:00
Dave Chinner
0dc8f7f139 xfs: fix ordering violation between cache flushes and tail updates
There is a race between the new CIL async data device metadata IO
completion cache flush and the log tail in the iclog the flush
covers being updated. This can be seen by repeating generic/482 in a
loop and eventually log recovery fails with a failures such as this:

XFS (dm-3): Starting recovery (logdev: internal)
XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
XFS (dm-3): Unmount and run xfs_repair
XFS (dm-3): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117

Analysis of the logwrite replay shows that there were no writes to
the data device between the FUA @ write 124 and the FUA at write @
125, but log recovery @ 125 failed. The difference was the one log
write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
and so the inode create intent in (1,8) was not replayed and so the
inode cluster was zero on disk when replay of the first inode item
in (1,32) was attempted.

What this meant was that the journal write that occurred at @ 125
did not ensure that metadata completed before the iclog was written
was correctly on stable storage. The tail of the log moved forward,
so IO must have been completed between the two iclog writes. This
means that there is a race condition between the unconditional async
cache flush in the CIL push work and the tail LSN that is written to
the iclog. This happens like so:

CIL push work				AIL push work
-------------				-------------
Add to committing list
start async data dev cache flush
.....
<flush completes>
<all writes to old tail lsn are stable>
xlog_write
  ....					push inode create buffer
					<start IO>
					.....
xlog_write(commit record)
  ....					<IO completes>
  					log tail moves
  					  xlog_assign_tail_lsn()
start_lsn == commit_lsn
  <no iclog preflush!>
xlog_state_release_iclog
  __xlog_state_release_iclog()
    <writes *new* tail_lsn into iclog>
  xlog_sync()
    ....
    submit_bio()
<tail in log moves forward without flushing written metadata>

Essentially, this can only occur if the commit iclog is issued
without a cache flush. If the iclog bio is submitted with
REQ_PREFLUSH, then it will guarantee that all the completed IO is
one stable storage before the iclog bio with the new tail LSN in it
is written to the log.

IOWs, the tail lsn that is written to the iclog needs to be sampled
*before* we issue the cache flush that guarantees all IO up to that
LSN has been completed.

To fix this without giving up the performance advantage of the
flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
compared to 5.13), we need to ensure that we always issue a cache
flush if the tail LSN changes between the initial async flush and
the commit record being written. THis requires sampling the tail_lsn
before we start the flush, and then passing the sampled tail LSN to
xlog_state_release_iclog() so it can determine if the the tail LSN
has changed while writing the checkpoint. If the tail LSN has
changed, then it needs to set the NEED_FLUSH flag on the iclog and
we'll issue another cache flush before writing the iclog.

Fixes: eef983ffea ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29 09:27:28 -07:00
Dave Chinner
9d39206440 xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog
Fold __xlog_state_release_iclog into its only caller to prepare
make an upcoming fix easier.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29 09:27:28 -07:00
Dave Chinner
b5d721eaae xfs: external logs need to flush data device
The recent journal flush/FUA changes replaced the flushing of the
data device on every iclog write with an up-front async data device
cache flush. Unfortunately, the assumption of which this was based
on has been proven incorrect by the flush vs log tail update
ordering issue. As the fix for that issue uses the
XLOG_ICL_NEED_FLUSH flag to indicate that data device needs a cache
flush, we now need to (once again) ensure that an iclog write to
external logs that need a cache flush to be issued actually issue a
cache flush to the data device as well as the log device.

Fixes: eef983ffea ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29 09:27:27 -07:00