futex: Prevent the reuse of stale pi_state

Jiri Slaby reported a futex state inconsistency resulting in -EINVAL during
a lock operation for a PI futex. It requires that the a lock process is
interrupted by a timeout or signal:

  T1 Owns the futex in user space.

  T2 Tries to acquire the futex in kernel (futex_lock_pi()). Allocates a
     pi_state and attaches itself to it.

  T2 Times out and removes its rt_waiter from the rt_mutex. Drops the
     rtmutex lock and tries to acquire the hash bucket lock to remove
     the futex_q. The lock is contended and T2 schedules out.

  T1 Unlocks the futex (futex_unlock_pi()). Finds a futex_q but no
     rt_waiter. Unlocks the futex (do_uncontended) and makes it available
     to user space.

  T3 Acquires the futex in user space.

  T4 Tries to acquire the futex in kernel (futex_lock_pi()). Finds the
     existing futex_q of T2 and tries to attach itself to the existing
     pi_state.  This (attach_to_pi_state()) fails with -EINVAL because uval
     contains the TID of T3 but pi_state points to T1.

It's incorrect to unlock the futex and make it available for user space to
acquire as long as there is still an existing state attached to it in the
kernel.

T1 cannot hand over the futex to T2 because T2 already gave up and started
to clean up and is blocked on the hash bucket lock, so T2's futex_q with
the pi_state pointing to T1 is still queued.

T2 observes the futex_q, but ignores it as there is no waiter on the
corresponding rt_mutex and takes the uncontended path which allows the
subsequent caller of futex_lock_pi() (T4) to observe that stale state.

To prevent this the unlock path must dequeue all futex_q entries which
point to the same pi_state when there is no waiter on the rt mutex. This
requires obviously to make the dequeue conditional in the locking path to
prevent a double dequeue. With that it's guaranteed that user space cannot
observe an uncontended futex which has kernel state attached.

Fixes: fbeb558b0d ("futex/pi: Fix recursive rt_mutex waiter state")
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20240118115451.0TkD_ZhB@linutronix.de
Closes: https://lore.kernel.org/all/4611bcf2-44d0-4c34-9b84-17406f881003@kernel.org
This commit is contained in:
Sebastian Andrzej Siewior 2024-01-18 12:54:51 +01:00 committed by Thomas Gleixner
parent 296455ade1
commit e626cb02ee
2 changed files with 20 additions and 6 deletions

View File

@ -627,12 +627,21 @@ retry:
}
/*
* PI futexes can not be requeued and must remove themselves from the
* hash bucket. The hash bucket lock (i.e. lock_ptr) is held.
* PI futexes can not be requeued and must remove themselves from the hash
* bucket. The hash bucket lock (i.e. lock_ptr) is held.
*/
void futex_unqueue_pi(struct futex_q *q)
{
__futex_unqueue(q);
/*
* If the lock was not acquired (due to timeout or signal) then the
* rt_waiter is removed before futex_q is. If this is observed by
* an unlocker after dropping the rtmutex wait lock and before
* acquiring the hash bucket lock, then the unlocker dequeues the
* futex_q from the hash bucket list to guarantee consistent state
* vs. userspace. Therefore the dequeue here must be conditional.
*/
if (!plist_node_empty(&q->list))
__futex_unqueue(q);
BUG_ON(!q->pi_state);
put_pi_state(q->pi_state);

View File

@ -1135,6 +1135,7 @@ retry:
hb = futex_hash(&key);
spin_lock(&hb->lock);
retry_hb:
/*
* Check waiters first. We do not trust user space values at
@ -1177,12 +1178,17 @@ retry:
/*
* Futex vs rt_mutex waiter state -- if there are no rt_mutex
* waiters even though futex thinks there are, then the waiter
* is leaving and the uncontended path is safe to take.
* is leaving. The entry needs to be removed from the list so a
* new futex_lock_pi() is not using this stale PI-state while
* the futex is available in user space again.
* There can be more than one task on its way out so it needs
* to retry.
*/
rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
if (!rt_waiter) {
__futex_unqueue(top_waiter);
raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
goto do_uncontended;
goto retry_hb;
}
get_pi_state(pi_state);
@ -1217,7 +1223,6 @@ retry:
return ret;
}
do_uncontended:
/*
* We have no kernel internal state, i.e. no waiters in the
* kernel. Waiters which are about to queue themselves are stuck