linux-stable/include/linux/rcupdate_trace.h
Paul E. McKenney ba3a86e472 rcu-tasks: Fix grace-period/unlock race in RCU Tasks Trace
The more intense grace-period processing resulting from the 50x RCU
Tasks Trace grace-period speedups exposed the following race condition:

o	Task A running on CPU 0 executes rcu_read_lock_trace(),
	entering a read-side critical section.

o	When Task A eventually invokes rcu_read_unlock_trace()
	to exit its read-side critical section, this function
	notes that the ->trc_reader_special.s flag is zero and
	and therefore invoke wil set ->trc_reader_nesting to zero
	using WRITE_ONCE().  But before that happens...

o	The RCU Tasks Trace grace-period kthread running on some other
	CPU interrogates Task A, but this fails because this task is
	currently running.  This kthread therefore sends an IPI to CPU 0.

o	CPU 0 receives the IPI, and thus invokes trc_read_check_handler().
	Because Task A has not yet cleared its ->trc_reader_nesting
	counter, this function sees that Task A is still within its
	read-side critical section.  This function therefore sets the
	->trc_reader_nesting.b.need_qs flag, AKA the .need_qs flag.

	Except that Task A has already checked the .need_qs flag, which
	is part of the ->trc_reader_special.s flag.  The .need_qs flag
	therefore remains set until Task A's next rcu_read_unlock_trace().

o	Task A now invokes synchronize_rcu_tasks_trace(), which cannot
	start a new grace period until the current grace period completes.
	And thus cannot return until after that time.

	But Task A's .need_qs flag is still set, which prevents the current
	grace period from completing.  And because Task A is blocked, it
	will never execute rcu_read_unlock_trace() until its call to
	synchronize_rcu_tasks_trace() returns.

	We are therefore deadlocked.

This race is improbable, but 80 hours of rcutorture made it happen twice.
The race was possible before the grace-period speedup, but roughly 50x
less probable.  Several thousand hours of rcutorture would have been
necessary to have a reasonable chance of making this happen before this
50x speedup.

This commit therefore eliminates this deadlock by setting
->trc_reader_nesting to a large negative number before checking the
.need_qs and zeroing (or decrementing with respect to its initial
value) ->trc_reader_nesting.  For its part, the IPI handler's
trc_read_check_handler() function adds a check for negative values,
deferring evaluation of the task in this case.  Taken together, these
changes avoid this deadlock scenario.

Fixes: 276c410448 ("rcu-tasks: Split ->trc_reader_need_end")
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: <bpf@vger.kernel.org>
Cc: <stable@vger.kernel.org> # 5.7.x
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-09-16 16:32:37 -07:00

92 lines
2.7 KiB
C

/* SPDX-License-Identifier: GPL-2.0+ */
/*
* Read-Copy Update mechanism for mutual exclusion, adapted for tracing.
*
* Copyright (C) 2020 Paul E. McKenney.
*/
#ifndef __LINUX_RCUPDATE_TRACE_H
#define __LINUX_RCUPDATE_TRACE_H
#include <linux/sched.h>
#include <linux/rcupdate.h>
#ifdef CONFIG_DEBUG_LOCK_ALLOC
extern struct lockdep_map rcu_trace_lock_map;
static inline int rcu_read_lock_trace_held(void)
{
return lock_is_held(&rcu_trace_lock_map);
}
#else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
static inline int rcu_read_lock_trace_held(void)
{
return 1;
}
#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
#ifdef CONFIG_TASKS_TRACE_RCU
void rcu_read_unlock_trace_special(struct task_struct *t, int nesting);
/**
* rcu_read_lock_trace - mark beginning of RCU-trace read-side critical section
*
* When synchronize_rcu_tasks_trace() is invoked by one task, then that
* task is guaranteed to block until all other tasks exit their read-side
* critical sections. Similarly, if call_rcu_trace() is invoked on one
* task while other tasks are within RCU read-side critical sections,
* invocation of the corresponding RCU callback is deferred until after
* the all the other tasks exit their critical sections.
*
* For more details, please see the documentation for rcu_read_lock().
*/
static inline void rcu_read_lock_trace(void)
{
struct task_struct *t = current;
WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 1);
barrier();
if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) &&
t->trc_reader_special.b.need_mb)
smp_mb(); // Pairs with update-side barriers
rcu_lock_acquire(&rcu_trace_lock_map);
}
/**
* rcu_read_unlock_trace - mark end of RCU-trace read-side critical section
*
* Pairs with a preceding call to rcu_read_lock_trace(), and nesting is
* allowed. Invoking a rcu_read_unlock_trace() when there is no matching
* rcu_read_lock_trace() is verboten, and will result in lockdep complaints.
*
* For more details, please see the documentation for rcu_read_unlock().
*/
static inline void rcu_read_unlock_trace(void)
{
int nesting;
struct task_struct *t = current;
rcu_lock_release(&rcu_trace_lock_map);
nesting = READ_ONCE(t->trc_reader_nesting) - 1;
barrier(); // Critical section before disabling.
// Disable IPI-based setting of .need_qs.
WRITE_ONCE(t->trc_reader_nesting, INT_MIN);
if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) {
WRITE_ONCE(t->trc_reader_nesting, nesting);
return; // We assume shallow reader nesting.
}
rcu_read_unlock_trace_special(t, nesting);
}
void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
void synchronize_rcu_tasks_trace(void);
void rcu_barrier_tasks_trace(void);
#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
#endif /* __LINUX_RCUPDATE_TRACE_H */