Locking updates:

- Move futex code into kernel/futex/ and split up the kitchen sink into
    seperate files to make integration of sys_futex_waitv() simpler.
 
  - Add a new sys_futex_waitv() syscall which allows to wait on multiple
    futexes. The main use case is emulating Windows' WaitForMultipleObjects
    which allows Wine to improve the performance of Windows Games. Also
    native Linux games can benefit from this interface as this is a common
    wait pattern for this kind of applications.
 
  - Add context to ww_mutex_trylock() to provide a path for i915 to rework
    their eviction code step by step without making lockdep upset until the
    final steps of rework are completed. It's also useful for regulator and
    TTM to avoid dropping locks in the non contended path.
 
  - Lockdep and might_sleep() cleanups and improvements
 
  - A few improvements for the RT substitutions.
 
  - The usual small improvements and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/FTITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVNZD/9vIm3Bu1Coz8tbNXz58AiCYq9Y/vp5
 mzFgSzz+VJTkW5Vh8jo5Uel4rCKZyt+rL276EoaRPzYl8KFtWDbpK3qd3PrXKqTX
 At49JO4ttAMJUHIBQ6vblEkykmfEd9YPU1uSWk5roJ+s7Jmr5VWnu0FEWHP00As5
 tWOca/TM0ei9kof26V2fl5aecTGII4i4Zsvy+LPsXtI+TnmP0gSBcGAS/5UnZTtJ
 vQRWTR3ojoYvh5iTmNqbaURYoQLe2j8yscn1DSW1CABWVmP12eDWs+N7jRP4b5S9
 73xOv5P7vpva41wxrK2ir5iNkpsLE97VL2JOHTW8nm7orblfiuxHLTCkTjEdd2pO
 h8blI2IBizEB3JYn2BMkOAaZQOSjN8hd6Ye/b2B4AMEGWeXEoEv6eVy/orYKCluQ
 XDqGn47Vce/SYmo5vfTB8VMt6nANx8PKvOP3IvjHInYEQBgiT6QrlUw3RRkXBp5s
 clQkjYYwjAMVIXowcCrdhoKjMROzi6STShVwHwGL8MaZXqr8Vl6BUO9ckU0pY+4C
 F000Hzwxi8lGEQ9k+P+BnYOEzH5osCty8lloKiQ/7ciX6T+CZHGJPGK/iY4YL8P5
 C3CJWMsHCqST7DodNFJmdfZt99UfIMmEhshMDduU9AAH0tHCn8vOu0U6WvCtpyBp
 BvHj68zteAtlYg==
 =RZ4x
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Thomas Gleixner:

 - Move futex code into kernel/futex/ and split up the kitchen sink into
   seperate files to make integration of sys_futex_waitv() simpler.

 - Add a new sys_futex_waitv() syscall which allows to wait on multiple
   futexes.

   The main use case is emulating Windows' WaitForMultipleObjects which
   allows Wine to improve the performance of Windows Games. Also native
   Linux games can benefit from this interface as this is a common wait
   pattern for this kind of applications.

 - Add context to ww_mutex_trylock() to provide a path for i915 to
   rework their eviction code step by step without making lockdep upset
   until the final steps of rework are completed. It's also useful for
   regulator and TTM to avoid dropping locks in the non contended path.

 - Lockdep and might_sleep() cleanups and improvements

 - A few improvements for the RT substitutions.

 - The usual small improvements and cleanups.

* tag 'locking-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
  locking: Remove spin_lock_flags() etc
  locking/rwsem: Fix comments about reader optimistic lock stealing conditions
  locking: Remove rcu_read_{,un}lock() for preempt_{dis,en}able()
  locking/rwsem: Disable preemption for spinning region
  docs: futex: Fix kernel-doc references
  futex: Fix PREEMPT_RT build
  futex2: Documentation: Document sys_futex_waitv() uAPI
  selftests: futex: Test sys_futex_waitv() wouldblock
  selftests: futex: Test sys_futex_waitv() timeout
  selftests: futex: Add sys_futex_waitv() test
  futex,arm: Wire up sys_futex_waitv()
  futex,x86: Wire up sys_futex_waitv()
  futex: Implement sys_futex_waitv()
  futex: Simplify double_lock_hb()
  futex: Split out wait/wake
  futex: Split out requeue
  futex: Rename mark_wake_futex()
  futex: Rename: match_futex()
  futex: Rename: hb_waiter_{inc,dec,pending}()
  futex: Split out PI futex
  ...
This commit is contained in:
Linus Torvalds 2021-11-01 13:15:36 -07:00
commit 595b28fb0c
63 changed files with 5520 additions and 4552 deletions

View file

@ -1352,7 +1352,19 @@ Mutex API reference
Futex API reference
===================
.. kernel-doc:: kernel/futex.c
.. kernel-doc:: kernel/futex/core.c
:internal:
.. kernel-doc:: kernel/futex/futex.h
:internal:
.. kernel-doc:: kernel/futex/pi.c
:internal:
.. kernel-doc:: kernel/futex/requeue.c
:internal:
.. kernel-doc:: kernel/futex/waitwake.c
:internal:
Further reading

View file

@ -1396,7 +1396,19 @@ Riferimento per l'API dei Mutex
Riferimento per l'API dei Futex
===============================
.. kernel-doc:: kernel/futex.c
.. kernel-doc:: kernel/futex/core.c
:internal:
.. kernel-doc:: kernel/futex/futex.h
:internal:
.. kernel-doc:: kernel/futex/pi.c
:internal:
.. kernel-doc:: kernel/futex/requeue.c
:internal:
.. kernel-doc:: kernel/futex/waitwake.c
:internal:
Approfondimenti

View file

@ -0,0 +1,86 @@
.. SPDX-License-Identifier: GPL-2.0
======
futex2
======
:Author: André Almeida <andrealmeid@collabora.com>
futex, or fast user mutex, is a set of syscalls to allow userspace to create
performant synchronization mechanisms, such as mutexes, semaphores and
conditional variables in userspace. C standard libraries, like glibc, uses it
as a means to implement more high level interfaces like pthreads.
futex2 is a followup version of the initial futex syscall, designed to overcome
limitations of the original interface.
User API
========
``futex_waitv()``
-----------------
Wait on an array of futexes, wake on any::
futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
unsigned int flags, struct timespec *timeout, clockid_t clockid)
struct futex_waitv {
__u64 val;
__u64 uaddr;
__u32 flags;
__u32 __reserved;
};
Userspace sets an array of struct futex_waitv (up to a max of 128 entries),
using ``uaddr`` for the address to wait for, ``val`` for the expected value
and ``flags`` to specify the type (e.g. private) and size of futex.
``__reserved`` needs to be 0, but it can be used for future extension. The
pointer for the first item of the array is passed as ``waiters``. An invalid
address for ``waiters`` or for any ``uaddr`` returns ``-EFAULT``.
If userspace has 32-bit pointers, it should do a explicit cast to make sure
the upper bits are zeroed. ``uintptr_t`` does the tricky and it works for
both 32/64-bit pointers.
``nr_futexes`` specifies the size of the array. Numbers out of [1, 128]
interval will make the syscall return ``-EINVAL``.
The ``flags`` argument of the syscall needs to be 0, but it can be used for
future extension.
For each entry in ``waiters`` array, the current value at ``uaddr`` is compared
to ``val``. If it's different, the syscall undo all the work done so far and
return ``-EAGAIN``. If all tests and verifications succeeds, syscall waits until
one of the following happens:
- The timeout expires, returning ``-ETIMEOUT``.
- A signal was sent to the sleeping task, returning ``-ERESTARTSYS``.
- Some futex at the list was woken, returning the index of some waked futex.
An example of how to use the interface can be found at ``tools/testing/selftests/futex/functional/futex_waitv.c``.
Timeout
-------
``struct timespec *timeout`` argument is an optional argument that points to an
absolute timeout. You need to specify the type of clock being used at
``clockid`` argument. ``CLOCK_MONOTONIC`` and ``CLOCK_REALTIME`` are supported.
This syscall accepts only 64bit timespec structs.
Types of futex
--------------
A futex can be either private or shared. Private is used for processes that
shares the same memory space and the virtual address of the futex will be the
same for all processes. This allows for optimizations in the kernel. To use
private futexes, it's necessary to specify ``FUTEX_PRIVATE_FLAG`` in the futex
flag. For processes that doesn't share the same memory space and therefore can
have different virtual addresses for the same futex (using, for instance, a
file-backed shared memory) requires different internal mechanisms to be get
properly enqueued. This is the default behavior, and it works with both private
and shared futexes.
Futexes can be of different sizes: 8, 16, 32 or 64 bits. Currently, the only
supported one is 32 bit sized futex, and it need to be specified using
``FUTEX_32`` flag.

View file

@ -28,6 +28,7 @@ place where this information is gathered.
media/index
sysfs-platform_profile
vduse
futex2
.. only:: subproject and html

View file

@ -7737,6 +7737,7 @@ M: Ingo Molnar <mingo@redhat.com>
R: Peter Zijlstra <peterz@infradead.org>
R: Darren Hart <dvhart@infradead.org>
R: Davidlohr Bueso <dave@stgolabs.net>
R: André Almeida <andrealmeid@collabora.com>
L: linux-kernel@vger.kernel.org
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/core
@ -7744,7 +7745,7 @@ F: Documentation/locking/*futex*
F: include/asm-generic/futex.h
F: include/linux/futex.h
F: include/uapi/linux/futex.h
F: kernel/futex.c
F: kernel/futex/*
F: tools/perf/bench/futex*
F: tools/testing/selftests/futex/

View file

@ -462,3 +462,4 @@
446 common landlock_restrict_self sys_landlock_restrict_self
# 447 reserved for memfd_secret
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv

View file

@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
#define __NR_compat_syscalls 449
#define __NR_compat_syscalls 450
#endif
#define __ARCH_WANT_SYS_CLONE

View file

@ -903,6 +903,8 @@ __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
__SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
#define __NR_process_mrelease 448
__SYSCALL(__NR_process_mrelease, sys_process_mrelease)
#define __NR_futex_waitv 449
__SYSCALL(__NR_futex_waitv, sys_futex_waitv)
/*
* Please add new compat syscalls above this comment and update

View file

@ -124,18 +124,13 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
__ticket_spin_unlock(lock);
}
static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
unsigned long flags)
{
arch_spin_lock(lock);
}
#define arch_spin_lock_flags arch_spin_lock_flags
#ifdef ASM_SUPPORTED
static __always_inline void
arch_read_lock_flags(arch_rwlock_t *lock, unsigned long flags)
arch_read_lock(arch_rwlock_t *lock)
{
unsigned long flags = 0;
__asm__ __volatile__ (
"tbit.nz p6, p0 = %1,%2\n"
"br.few 3f\n"
@ -157,13 +152,8 @@ arch_read_lock_flags(arch_rwlock_t *lock, unsigned long flags)
: "p6", "p7", "r2", "memory");
}
#define arch_read_lock_flags arch_read_lock_flags
#define arch_read_lock(lock) arch_read_lock_flags(lock, 0)
#else /* !ASM_SUPPORTED */
#define arch_read_lock_flags(rw, flags) arch_read_lock(rw)
#define arch_read_lock(rw) \
do { \
arch_rwlock_t *__read_lock_ptr = (rw); \
@ -186,8 +176,10 @@ do { \
#ifdef ASM_SUPPORTED
static __always_inline void
arch_write_lock_flags(arch_rwlock_t *lock, unsigned long flags)
arch_write_lock(arch_rwlock_t *lock)
{
unsigned long flags = 0;
__asm__ __volatile__ (
"tbit.nz p6, p0 = %1, %2\n"
"mov ar.ccv = r0\n"
@ -210,9 +202,6 @@ arch_write_lock_flags(arch_rwlock_t *lock, unsigned long flags)
: "ar.ccv", "p6", "p7", "r2", "r29", "memory");
}
#define arch_write_lock_flags arch_write_lock_flags
#define arch_write_lock(rw) arch_write_lock_flags(rw, 0)
#define arch_write_trylock(rw) \
({ \
register long result; \

View file

@ -19,9 +19,6 @@
#include <asm/qrwlock.h>
#define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
#define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
#define arch_spin_relax(lock) cpu_relax()
#define arch_read_relax(lock) cpu_relax()
#define arch_write_relax(lock) cpu_relax()

View file

@ -23,21 +23,6 @@ static inline void arch_spin_lock(arch_spinlock_t *x)
continue;
}
static inline void arch_spin_lock_flags(arch_spinlock_t *x,
unsigned long flags)
{
volatile unsigned int *a;
a = __ldcw_align(x);
while (__ldcw(a) == 0)
while (*a == 0)
if (flags & PSW_SM_I) {
local_irq_enable();
local_irq_disable();
}
}
#define arch_spin_lock_flags arch_spin_lock_flags
static inline void arch_spin_unlock(arch_spinlock_t *x)
{
volatile unsigned int *a;

View file

@ -123,27 +123,6 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
}
}
static inline
void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
{
unsigned long flags_dis;
while (1) {
if (likely(__arch_spin_trylock(lock) == 0))
break;
local_save_flags(flags_dis);
local_irq_restore(flags);
do {
HMT_low();
if (is_shared_processor())
splpar_spin_yield(lock);
} while (unlikely(lock->slock != 0));
HMT_medium();
local_irq_restore(flags_dis);
}
}
#define arch_spin_lock_flags arch_spin_lock_flags
static inline void arch_spin_unlock(arch_spinlock_t *lock)
{
__asm__ __volatile__("# arch_spin_unlock\n\t"

View file

@ -67,14 +67,6 @@ static inline void arch_spin_lock(arch_spinlock_t *lp)
arch_spin_lock_wait(lp);
}
static inline void arch_spin_lock_flags(arch_spinlock_t *lp,
unsigned long flags)
{
if (!arch_spin_trylock_once(lp))
arch_spin_lock_wait(lp);
}
#define arch_spin_lock_flags arch_spin_lock_flags
static inline int arch_spin_trylock(arch_spinlock_t *lp)
{
if (!arch_spin_trylock_once(lp))

View file

@ -453,3 +453,4 @@
446 i386 landlock_restrict_self sys_landlock_restrict_self
447 i386 memfd_secret sys_memfd_secret
448 i386 process_mrelease sys_process_mrelease
449 i386 futex_waitv sys_futex_waitv

View file

@ -370,6 +370,7 @@
446 common landlock_restrict_self sys_landlock_restrict_self
447 common memfd_secret sys_memfd_secret
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
#
# Due to a historical design error, certain syscalls are numbered differently

View file

@ -248,7 +248,7 @@ static inline int modeset_lock(struct drm_modeset_lock *lock,
if (ctx->trylock_only) {
lockdep_assert_held(&ctx->ww_ctx);
if (!ww_mutex_trylock(&lock->mutex))
if (!ww_mutex_trylock(&lock->mutex, NULL))
return -EBUSY;
else
return 0;

View file

@ -145,7 +145,7 @@ static inline int regulator_lock_nested(struct regulator_dev *rdev,
mutex_lock(&regulator_nesting_mutex);
if (ww_ctx || !ww_mutex_trylock(&rdev->mutex)) {
if (!ww_mutex_trylock(&rdev->mutex, ww_ctx)) {
if (rdev->mutex_owner == current)
rdev->ref_cnt++;
else

View file

@ -47,8 +47,6 @@ extern int debug_locks_off(void);
# define locking_selftest() do { } while (0)
#endif
struct task_struct;
#ifdef CONFIG_LOCKDEP
extern void debug_show_all_locks(void);
extern void debug_show_held_locks(struct task_struct *task);

View file

@ -173,7 +173,7 @@ static inline int dma_resv_lock_slow_interruptible(struct dma_resv *obj,
*/
static inline bool __must_check dma_resv_trylock(struct dma_resv *obj)
{
return ww_mutex_trylock(&obj->lock);
return ww_mutex_trylock(&obj->lock, NULL);
}
/**

View file

@ -111,8 +111,8 @@ static __always_inline void might_resched(void)
#endif /* CONFIG_PREEMPT_* */
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
extern void ___might_sleep(const char *file, int line, int preempt_offset);
extern void __might_sleep(const char *file, int line, int preempt_offset);
extern void __might_resched(const char *file, int line, unsigned int offsets);
extern void __might_sleep(const char *file, int line);
extern void __cant_sleep(const char *file, int line, int preempt_offset);
extern void __cant_migrate(const char *file, int line);
@ -129,7 +129,7 @@ extern void __cant_migrate(const char *file, int line);
* supposed to.
*/
# define might_sleep() \
do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0)
do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0)
/**
* cant_sleep - annotation for functions that cannot sleep
*
@ -168,10 +168,9 @@ extern void __cant_migrate(const char *file, int line);
*/
# define non_block_end() WARN_ON(current->non_block_count-- == 0)
#else
static inline void ___might_sleep(const char *file, int line,
int preempt_offset) { }
static inline void __might_sleep(const char *file, int line,
int preempt_offset) { }
static inline void __might_resched(const char *file, int line,
unsigned int offsets) { }
static inline void __might_sleep(const char *file, int line) { }
# define might_sleep() do { might_resched(); } while (0)
# define cant_sleep() do { } while (0)
# define cant_migrate() do { } while (0)

View file

@ -481,23 +481,6 @@ do { \
#endif /* CONFIG_LOCK_STAT */
#ifdef CONFIG_LOCKDEP
/*
* On lockdep we dont want the hand-coded irq-enable of
* _raw_*_lock_flags() code, because lockdep assumes
* that interrupts are not re-enabled during lock-acquire:
*/
#define LOCK_CONTENDED_FLAGS(_lock, try, lock, lockfl, flags) \
LOCK_CONTENDED((_lock), (try), (lock))
#else /* CONFIG_LOCKDEP */
#define LOCK_CONTENDED_FLAGS(_lock, try, lock, lockfl, flags) \
lockfl((_lock), (flags))
#endif /* CONFIG_LOCKDEP */
#ifdef CONFIG_PROVE_LOCKING
extern void print_irqtrace_events(struct task_struct *curr);
#else

View file

@ -21,7 +21,7 @@ enum lockdep_wait_type {
LD_WAIT_SPIN, /* spin loops, raw_spinlock_t etc.. */
#ifdef CONFIG_PROVE_RAW_LOCK_NESTING
LD_WAIT_CONFIG, /* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */
LD_WAIT_CONFIG, /* preemptible in PREEMPT_RT, spinlock_t etc.. */
#else
LD_WAIT_CONFIG = LD_WAIT_SPIN,
#endif

View file

@ -122,9 +122,10 @@
* The preempt_count offset after spin_lock()
*/
#if !defined(CONFIG_PREEMPT_RT)
#define PREEMPT_LOCK_OFFSET PREEMPT_DISABLE_OFFSET
#define PREEMPT_LOCK_OFFSET PREEMPT_DISABLE_OFFSET
#else
#define PREEMPT_LOCK_OFFSET 0
/* Locks on RT do not disable preemption */
#define PREEMPT_LOCK_OFFSET 0
#endif
/*

View file

@ -30,31 +30,16 @@ do { \
#ifdef CONFIG_DEBUG_SPINLOCK
extern void do_raw_read_lock(rwlock_t *lock) __acquires(lock);
#define do_raw_read_lock_flags(lock, flags) do_raw_read_lock(lock)
extern int do_raw_read_trylock(rwlock_t *lock);
extern void do_raw_read_unlock(rwlock_t *lock) __releases(lock);
extern void do_raw_write_lock(rwlock_t *lock) __acquires(lock);
#define do_raw_write_lock_flags(lock, flags) do_raw_write_lock(lock)
extern int do_raw_write_trylock(rwlock_t *lock);
extern void do_raw_write_unlock(rwlock_t *lock) __releases(lock);
#else
#ifndef arch_read_lock_flags
# define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
#endif
#ifndef arch_write_lock_flags
# define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
#endif
# define do_raw_read_lock(rwlock) do {__acquire(lock); arch_read_lock(&(rwlock)->raw_lock); } while (0)
# define do_raw_read_lock_flags(lock, flags) \
do {__acquire(lock); arch_read_lock_flags(&(lock)->raw_lock, *(flags)); } while (0)
# define do_raw_read_trylock(rwlock) arch_read_trylock(&(rwlock)->raw_lock)
# define do_raw_read_unlock(rwlock) do {arch_read_unlock(&(rwlock)->raw_lock); __release(lock); } while (0)
# define do_raw_write_lock(rwlock) do {__acquire(lock); arch_write_lock(&(rwlock)->raw_lock); } while (0)
# define do_raw_write_lock_flags(lock, flags) \
do {__acquire(lock); arch_write_lock_flags(&(lock)->raw_lock, *(flags)); } while (0)
# define do_raw_write_trylock(rwlock) arch_write_trylock(&(rwlock)->raw_lock)
# define do_raw_write_unlock(rwlock) do {arch_write_unlock(&(rwlock)->raw_lock); __release(lock); } while (0)
#endif

View file

@ -157,8 +157,7 @@ static inline unsigned long __raw_read_lock_irqsave(rwlock_t *lock)
local_irq_save(flags);
preempt_disable();
rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED_FLAGS(lock, do_raw_read_trylock, do_raw_read_lock,
do_raw_read_lock_flags, &flags);
LOCK_CONTENDED(lock, do_raw_read_trylock, do_raw_read_lock);
return flags;
}
@ -184,8 +183,7 @@ static inline unsigned long __raw_write_lock_irqsave(rwlock_t *lock)
local_irq_save(flags);
preempt_disable();
rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED_FLAGS(lock, do_raw_write_trylock, do_raw_write_lock,
do_raw_write_lock_flags, &flags);
LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock);
return flags;
}

View file

@ -2037,7 +2037,7 @@ static inline int _cond_resched(void) { return 0; }
#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */
#define cond_resched() ({ \
___might_sleep(__FILE__, __LINE__, 0); \
__might_resched(__FILE__, __LINE__, 0); \
_cond_resched(); \
})
@ -2045,19 +2045,38 @@ extern int __cond_resched_lock(spinlock_t *lock);
extern int __cond_resched_rwlock_read(rwlock_t *lock);
extern int __cond_resched_rwlock_write(rwlock_t *lock);
#define cond_resched_lock(lock) ({ \
___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\
__cond_resched_lock(lock); \
#define MIGHT_RESCHED_RCU_SHIFT 8
#define MIGHT_RESCHED_PREEMPT_MASK ((1U << MIGHT_RESCHED_RCU_SHIFT) - 1)
#ifndef CONFIG_PREEMPT_RT
/*
* Non RT kernels have an elevated preempt count due to the held lock,
* but are not allowed to be inside a RCU read side critical section
*/
# define PREEMPT_LOCK_RESCHED_OFFSETS PREEMPT_LOCK_OFFSET
#else
/*
* spin/rw_lock() on RT implies rcu_read_lock(). The might_sleep() check in
* cond_resched*lock() has to take that into account because it checks for
* preempt_count() and rcu_preempt_depth().
*/
# define PREEMPT_LOCK_RESCHED_OFFSETS \
(PREEMPT_LOCK_OFFSET + (1U << MIGHT_RESCHED_RCU_SHIFT))
#endif
#define cond_resched_lock(lock) ({ \
__might_resched(__FILE__, __LINE__, PREEMPT_LOCK_RESCHED_OFFSETS); \
__cond_resched_lock(lock); \
})
#define cond_resched_rwlock_read(lock) ({ \
__might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET); \
__cond_resched_rwlock_read(lock); \
#define cond_resched_rwlock_read(lock) ({ \
__might_resched(__FILE__, __LINE__, PREEMPT_LOCK_RESCHED_OFFSETS); \
__cond_resched_rwlock_read(lock); \
})
#define cond_resched_rwlock_write(lock) ({ \
__might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET); \
__cond_resched_rwlock_write(lock); \
#define cond_resched_rwlock_write(lock) ({ \
__might_resched(__FILE__, __LINE__, PREEMPT_LOCK_RESCHED_OFFSETS); \
__cond_resched_rwlock_write(lock); \
})
static inline void cond_resched_rcu(void)

View file

@ -177,7 +177,6 @@ do { \
#ifdef CONFIG_DEBUG_SPINLOCK
extern void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock);
#define do_raw_spin_lock_flags(lock, flags) do_raw_spin_lock(lock)
extern int do_raw_spin_trylock(raw_spinlock_t *lock);
extern void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock);
#else
@ -188,18 +187,6 @@ static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
mmiowb_spin_lock();
}
#ifndef arch_spin_lock_flags
#define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
#endif
static inline void
do_raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long *flags) __acquires(lock)
{
__acquire(lock);
arch_spin_lock_flags(&lock->raw_lock, *flags);
mmiowb_spin_lock();
}
static inline int do_raw_spin_trylock(raw_spinlock_t *lock)
{
int ret = arch_spin_trylock(&(lock)->raw_lock);

View file

@ -108,16 +108,7 @@ static inline unsigned long __raw_spin_lock_irqsave(raw_spinlock_t *lock)
local_irq_save(flags);
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
/*
* On lockdep we dont want the hand-coded irq-enable of
* do_raw_spin_lock_flags() code, because lockdep assumes
* that interrupts are not re-enabled during lock-acquire:
*/
#ifdef CONFIG_LOCKDEP
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
#else
do_raw_spin_lock_flags(lock, &flags);
#endif
return flags;
}

View file

@ -62,7 +62,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock)
#define arch_spin_is_locked(lock) ((void)(lock), 0)
/* for sched/core.c and kernel_lock.c: */
# define arch_spin_lock(lock) do { barrier(); (void)(lock); } while (0)
# define arch_spin_lock_flags(lock, flags) do { barrier(); (void)(lock); } while (0)
# define arch_spin_unlock(lock) do { barrier(); (void)(lock); } while (0)
# define arch_spin_trylock(lock) ({ barrier(); (void)(lock); 1; })
#endif /* DEBUG_SPINLOCK */

View file

@ -58,6 +58,7 @@ struct mq_attr;
struct compat_stat;
struct old_timeval32;
struct robust_list_head;
struct futex_waitv;
struct getcpu_cache;
struct old_linux_dirent;
struct perf_event_attr;
@ -610,7 +611,7 @@ asmlinkage long sys_waitid(int which, pid_t pid,
asmlinkage long sys_set_tid_address(int __user *tidptr);
asmlinkage long sys_unshare(unsigned long unshare_flags);
/* kernel/futex.c */
/* kernel/futex/syscalls.c */
asmlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
const struct __kernel_timespec __user *utime,
u32 __user *uaddr2, u32 val3);
@ -623,6 +624,10 @@ asmlinkage long sys_get_robust_list(int pid,
asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
size_t len);
asmlinkage long sys_futex_waitv(struct futex_waitv *waiters,
unsigned int nr_futexes, unsigned int flags,
struct __kernel_timespec __user *timeout, clockid_t clockid);
/* kernel/hrtimer.c */
asmlinkage long sys_nanosleep(struct __kernel_timespec __user *rqtp,
struct __kernel_timespec __user *rmtp);

View file

@ -28,12 +28,10 @@
#ifndef CONFIG_PREEMPT_RT
#define WW_MUTEX_BASE mutex
#define ww_mutex_base_init(l,n,k) __mutex_init(l,n,k)
#define ww_mutex_base_trylock(l) mutex_trylock(l)
#define ww_mutex_base_is_locked(b) mutex_is_locked((b))
#else
#define WW_MUTEX_BASE rt_mutex
#define ww_mutex_base_init(l,n,k) __rt_mutex_init(l,n,k)
#define ww_mutex_base_trylock(l) rt_mutex_trylock(l)
#define ww_mutex_base_is_locked(b) rt_mutex_base_is_locked(&(b)->rtmutex)
#endif
@ -339,17 +337,8 @@ ww_mutex_lock_slow_interruptible(struct ww_mutex *lock,
extern void ww_mutex_unlock(struct ww_mutex *lock);
/**
* ww_mutex_trylock - tries to acquire the w/w mutex without acquire context
* @lock: mutex to lock
*
* Trylocks a mutex without acquire context, so no deadlock detection is
* possible. Returns 1 if the mutex has been acquired successfully, 0 otherwise.
*/
static inline int __must_check ww_mutex_trylock(struct ww_mutex *lock)
{
return ww_mutex_base_trylock(&lock->base);
}
extern int __must_check ww_mutex_trylock(struct ww_mutex *lock,
struct ww_acquire_ctx *ctx);
/***
* ww_mutex_destroy - mark a w/w mutex unusable

View file

@ -880,8 +880,11 @@ __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
#define __NR_process_mrelease 448
__SYSCALL(__NR_process_mrelease, sys_process_mrelease)
#define __NR_futex_waitv 449
__SYSCALL(__NR_futex_waitv, sys_futex_waitv)
#undef __NR_syscalls
#define __NR_syscalls 449
#define __NR_syscalls 450
/*
* 32 bit systems traditionally used different

View file

@ -43,6 +43,31 @@
#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
FUTEX_PRIVATE_FLAG)
/*
* Flags to specify the bit length of the futex word for futex2 syscalls.
* Currently, only 32 is supported.
*/
#define FUTEX_32 2
/*
* Max numbers of elements in a futex_waitv array
*/
#define FUTEX_WAITV_MAX 128
/**
* struct futex_waitv - A waiter for vectorized wait
* @val: Expected value at uaddr
* @uaddr: User address to wait on
* @flags: Flags for this waiter
* @__reserved: Reserved member to preserve data alignment. Should be 0.
*/
struct futex_waitv {
__u64 val;
__u64 uaddr;
__u32 flags;
__u32 __reserved;
};
/*
* Support for robust futexes: the kernel cleans up held futexes at
* thread exit time.

View file

@ -59,7 +59,7 @@ obj-$(CONFIG_FREEZER) += freezer.o
obj-$(CONFIG_PROFILING) += profile.o
obj-$(CONFIG_STACKTRACE) += stacktrace.o
obj-y += time/
obj-$(CONFIG_FUTEX) += futex.o
obj-$(CONFIG_FUTEX) += futex/
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
obj-$(CONFIG_SMP) += smp.o
ifneq ($(CONFIG_SMP),y)

File diff suppressed because it is too large Load diff

3
kernel/futex/Makefile Normal file
View file

@ -0,0 +1,3 @@
# SPDX-License-Identifier: GPL-2.0
obj-y += core.o syscalls.o pi.o requeue.o waitwake.o

1176
kernel/futex/core.c Normal file

File diff suppressed because it is too large Load diff

299
kernel/futex/futex.h Normal file
View file

@ -0,0 +1,299 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _FUTEX_H
#define _FUTEX_H
#include <linux/futex.h>
#include <linux/sched/wake_q.h>
#ifdef CONFIG_PREEMPT_RT
#include <linux/rcuwait.h>
#endif
#include <asm/futex.h>
/*
* Futex flags used to encode options to functions and preserve them across
* restarts.
*/
#ifdef CONFIG_MMU
# define FLAGS_SHARED 0x01
#else
/*
* NOMMU does not have per process address space. Let the compiler optimize
* code away.
*/
# define FLAGS_SHARED 0x00
#endif
#define FLAGS_CLOCKRT 0x02
#define FLAGS_HAS_TIMEOUT 0x04
#ifdef CONFIG_HAVE_FUTEX_CMPXCHG
#define futex_cmpxchg_enabled 1
#else
extern int __read_mostly futex_cmpxchg_enabled;
#endif
#ifdef CONFIG_FAIL_FUTEX
extern bool should_fail_futex(bool fshared);
#else
static inline bool should_fail_futex(bool fshared)
{
return false;
}
#endif
/*
* Hash buckets are shared by all the futex_keys that hash to the same
* location. Each key may have multiple futex_q structures, one for each task
* waiting on a futex.
*/
struct futex_hash_bucket {
atomic_t waiters;
spinlock_t lock;
struct plist_head chain;
} ____cacheline_aligned_in_smp;
/*
* Priority Inheritance state:
*/
struct futex_pi_state {
/*
* list of 'owned' pi_state instances - these have to be
* cleaned up in do_exit() if the task exits prematurely:
*/
struct list_head list;
/*
* The PI object:
*/
struct rt_mutex_base pi_mutex;
struct task_struct *owner;
refcount_t refcount;
union futex_key key;
} __randomize_layout;
/**
* struct futex_q - The hashed futex queue entry, one per waiting task
* @list: priority-sorted list of tasks waiting on this futex
* @task: the task waiting on the futex
* @lock_ptr: the hash bucket lock
* @key: the key the futex is hashed on
* @pi_state: optional priority inheritance state
* @rt_waiter: rt_waiter storage for use with requeue_pi
* @requeue_pi_key: the requeue_pi target futex key
* @bitset: bitset for the optional bitmasked wakeup
* @requeue_state: State field for futex_requeue_pi()
* @requeue_wait: RCU wait for futex_requeue_pi() (RT only)
*
* We use this hashed waitqueue, instead of a normal wait_queue_entry_t, so
* we can wake only the relevant ones (hashed queues may be shared).
*
* A futex_q has a woken state, just like tasks have TASK_RUNNING.
* It is considered woken when plist_node_empty(&q->list) || q->lock_ptr == 0.
* The order of wakeup is always to make the first condition true, then
* the second.
*
* PI futexes are typically woken before they are removed from the hash list via
* the rt_mutex code. See futex_unqueue_pi().
*/
struct futex_q {
struct plist_node list;
struct task_struct *task;
spinlock_t *lock_ptr;
union futex_key key;
struct futex_pi_state *pi_state;
struct rt_mutex_waiter *rt_waiter;
union futex_key *requeue_pi_key;
u32 bitset;
atomic_t requeue_state;
#ifdef CONFIG_PREEMPT_RT
struct rcuwait requeue_wait;
#endif
} __randomize_layout;
extern const struct futex_q futex_q_init;
enum futex_access {
FUTEX_READ,
FUTEX_WRITE
};
extern int get_futex_key(u32 __user *uaddr, bool fshared, union futex_key *key,
enum futex_access rw);
extern struct hrtimer_sleeper *
futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
int flags, u64 range_ns);
extern struct futex_hash_bucket *futex_hash(union futex_key *key);
/**
* futex_match - Check whether two futex keys are equal
* @key1: Pointer to key1
* @key2: Pointer to key2
*
* Return 1 if two futex_keys are equal, 0 otherwise.
*/
static inline int futex_match(union futex_key *key1, union futex_key *key2)
{
return (key1 && key2
&& key1->both.word == key2->both.word
&& key1->both.ptr == key2->both.ptr
&& key1->both.offset == key2->both.offset);
}
extern int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
struct futex_q *q, struct futex_hash_bucket **hb);
extern void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
struct hrtimer_sleeper *timeout);
extern void futex_wake_mark(struct wake_q_head *wake_q, struct futex_q *q);
extern int fault_in_user_writeable(u32 __user *uaddr);
extern int futex_cmpxchg_value_locked(u32 *curval, u32 __user *uaddr, u32 uval, u32 newval);
extern int futex_get_value_locked(u32 *dest, u32 __user *from);
extern struct futex_q *futex_top_waiter(struct futex_hash_bucket *hb, union futex_key *key);
extern void __futex_unqueue(struct futex_q *q);
extern void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb);
extern int futex_unqueue(struct futex_q *q);
/**
* futex_queue() - Enqueue the futex_q on the futex_hash_bucket
* @q: The futex_q to enqueue
* @hb: The destination hash bucket
*
* The hb->lock must be held by the caller, and is released here. A call to
* futex_queue() is typically paired with exactly one call to futex_unqueue(). The
* exceptions involve the PI related operations, which may use futex_unqueue_pi()
* or nothing if the unqueue is done as part of the wake process and the unqueue
* state is implicit in the state of woken task (see futex_wait_requeue_pi() for
* an example).
*/
static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
__futex_queue(q, hb);
spin_unlock(&hb->lock);
}
extern void futex_unqueue_pi(struct futex_q *q);
extern void wait_for_owner_exiting(int ret, struct task_struct *exiting);
/*
* Reflects a new waiter being added to the waitqueue.
*/
static inline void futex_hb_waiters_inc(struct futex_hash_bucket *hb)
{
#ifdef CONFIG_SMP
atomic_inc(&hb->waiters);
/*
* Full barrier (A), see the ordering comment above.
*/
smp_mb__after_atomic();
#endif
}
/*
* Reflects a waiter being removed from the waitqueue by wakeup
* paths.
*/
static inline void futex_hb_waiters_dec(struct futex_hash_bucket *hb)
{
#ifdef CONFIG_SMP
atomic_dec(&hb->waiters);
#endif
}
static inline int futex_hb_waiters_pending(struct futex_hash_bucket *hb)
{
#ifdef CONFIG_SMP
/*
* Full barrier (B), see the ordering comment above.
*/
smp_mb();
return atomic_read(&hb->waiters);
#else
return 1;
#endif
}
extern struct futex_hash_bucket *futex_q_lock(struct futex_q *q);
extern void futex_q_unlock(struct futex_hash_bucket *hb);
extern int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
union futex_key *key,
struct futex_pi_state **ps,
struct task_struct *task,
struct task_struct **exiting,
int set_waiters);
extern int refill_pi_state_cache(void);
extern void get_pi_state(struct futex_pi_state *pi_state);
extern void put_pi_state(struct futex_pi_state *pi_state);
extern int fixup_pi_owner(u32 __user *uaddr, struct futex_q *q, int locked);
/*
* Express the locking dependencies for lockdep:
*/
static inline void
double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
{
if (hb1 > hb2)
swap(hb1, hb2);
spin_lock(&hb1->lock);
if (hb1 != hb2)
spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING);
}
static inline void
double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
{
spin_unlock(&hb1->lock);
if (hb1 != hb2)
spin_unlock(&hb2->lock);
}
/* syscalls */
extern int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, u32
val, ktime_t *abs_time, u32 bitset, u32 __user
*uaddr2);
extern int futex_requeue(u32 __user *uaddr1, unsigned int flags,
u32 __user *uaddr2, int nr_wake, int nr_requeue,
u32 *cmpval, int requeue_pi);
extern int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
ktime_t *abs_time, u32 bitset);
/**
* struct futex_vector - Auxiliary struct for futex_waitv()
* @w: Userspace provided data
* @q: Kernel side data
*
* Struct used to build an array with all data need for futex_waitv()
*/
struct futex_vector {
struct futex_waitv w;
struct futex_q q;
};
extern int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
struct hrtimer_sleeper *to);
extern int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset);
extern int futex_wake_op(u32 __user *uaddr1, unsigned int flags,
u32 __user *uaddr2, int nr_wake, int nr_wake2, int op);
extern int futex_unlock_pi(u32 __user *uaddr, unsigned int flags);
extern int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int trylock);
#endif /* _FUTEX_H */

1233
kernel/futex/pi.c Normal file

File diff suppressed because it is too large Load diff

897
kernel/futex/requeue.c Normal file
View file

@ -0,0 +1,897 @@
// SPDX-License-Identifier: GPL-2.0-or-later
#include <linux/sched/signal.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
/*
* On PREEMPT_RT, the hash bucket lock is a 'sleeping' spinlock with an
* underlying rtmutex. The task which is about to be requeued could have
* just woken up (timeout, signal). After the wake up the task has to
* acquire hash bucket lock, which is held by the requeue code. As a task
* can only be blocked on _ONE_ rtmutex at a time, the proxy lock blocking
* and the hash bucket lock blocking would collide and corrupt state.
*
* On !PREEMPT_RT this is not a problem and everything could be serialized
* on hash bucket lock, but aside of having the benefit of common code,
* this allows to avoid doing the requeue when the task is already on the
* way out and taking the hash bucket lock of the original uaddr1 when the
* requeue has been completed.
*
* The following state transitions are valid:
*
* On the waiter side:
* Q_REQUEUE_PI_NONE -> Q_REQUEUE_PI_IGNORE
* Q_REQUEUE_PI_IN_PROGRESS -> Q_REQUEUE_PI_WAIT
*
* On the requeue side:
* Q_REQUEUE_PI_NONE -> Q_REQUEUE_PI_INPROGRESS
* Q_REQUEUE_PI_IN_PROGRESS -> Q_REQUEUE_PI_DONE/LOCKED
* Q_REQUEUE_PI_IN_PROGRESS -> Q_REQUEUE_PI_NONE (requeue failed)
* Q_REQUEUE_PI_WAIT -> Q_REQUEUE_PI_DONE/LOCKED
* Q_REQUEUE_PI_WAIT -> Q_REQUEUE_PI_IGNORE (requeue failed)
*
* The requeue side ignores a waiter with state Q_REQUEUE_PI_IGNORE as this
* signals that the waiter is already on the way out. It also means that
* the waiter is still on the 'wait' futex, i.e. uaddr1.
*
* The waiter side signals early wakeup to the requeue side either through
* setting state to Q_REQUEUE_PI_IGNORE or to Q_REQUEUE_PI_WAIT depending
* on the current state. In case of Q_REQUEUE_PI_IGNORE it can immediately
* proceed to take the hash bucket lock of uaddr1. If it set state to WAIT,
* which means the wakeup is interleaving with a requeue in progress it has
* to wait for the requeue side to change the state. Either to DONE/LOCKED
* or to IGNORE. DONE/LOCKED means the waiter q is now on the uaddr2 futex
* and either blocked (DONE) or has acquired it (LOCKED). IGNORE is set by
* the requeue side when the requeue attempt failed via deadlock detection
* and therefore the waiter q is still on the uaddr1 futex.
*/
enum {
Q_REQUEUE_PI_NONE = 0,
Q_REQUEUE_PI_IGNORE,
Q_REQUEUE_PI_IN_PROGRESS,
Q_REQUEUE_PI_WAIT,
Q_REQUEUE_PI_DONE,
Q_REQUEUE_PI_LOCKED,
};
const struct futex_q futex_q_init = {
/* list gets initialized in futex_queue()*/
.key = FUTEX_KEY_INIT,
.bitset = FUTEX_BITSET_MATCH_ANY,
.requeue_state = ATOMIC_INIT(Q_REQUEUE_PI_NONE),
};
/**
* requeue_futex() - Requeue a futex_q from one hb to another
* @q: the futex_q to requeue
* @hb1: the source hash_bucket
* @hb2: the target hash_bucket
* @key2: the new key for the requeued futex_q
*/
static inline
void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
struct futex_hash_bucket *hb2, union futex_key *key2)
{
/*
* If key1 and key2 hash to the same bucket, no need to
* requeue.
*/
if (likely(&hb1->chain != &hb2->chain)) {
plist_del(&q->list, &hb1->chain);
futex_hb_waiters_dec(hb1);
futex_hb_waiters_inc(hb2);
plist_add(&q->list, &hb2->chain);
q->lock_ptr = &hb2->lock;
}
q->key = *key2;
}
static inline bool futex_requeue_pi_prepare(struct futex_q *q,
struct futex_pi_state *pi_state)
{
int old, new;
/*
* Set state to Q_REQUEUE_PI_IN_PROGRESS unless an early wakeup has
* already set Q_REQUEUE_PI_IGNORE to signal that requeue should
* ignore the waiter.
*/
old = atomic_read_acquire(&q->requeue_state);
do {
if (old == Q_REQUEUE_PI_IGNORE)
return false;
/*
* futex_proxy_trylock_atomic() might have set it to
* IN_PROGRESS and a interleaved early wake to WAIT.
*
* It was considered to have an extra state for that
* trylock, but that would just add more conditionals
* all over the place for a dubious value.
*/
if (old != Q_REQUEUE_PI_NONE)
break;
new = Q_REQUEUE_PI_IN_PROGRESS;
} while (!atomic_try_cmpxchg(&q->requeue_state, &old, new));
q->pi_state = pi_state;
return true;
}
static inline void futex_requeue_pi_complete(struct futex_q *q, int locked)
{
int old, new;
old = atomic_read_acquire(&q->requeue_state);
do {
if (old == Q_REQUEUE_PI_IGNORE)
return;
if (locked >= 0) {
/* Requeue succeeded. Set DONE or LOCKED */
WARN_ON_ONCE(old != Q_REQUEUE_PI_IN_PROGRESS &&
old != Q_REQUEUE_PI_WAIT);
new = Q_REQUEUE_PI_DONE + locked;
} else if (old == Q_REQUEUE_PI_IN_PROGRESS) {
/* Deadlock, no early wakeup interleave */
new = Q_REQUEUE_PI_NONE;
} else {
/* Deadlock, early wakeup interleave. */
WARN_ON_ONCE(old != Q_REQUEUE_PI_WAIT);
new = Q_REQUEUE_PI_IGNORE;
}
} while (!atomic_try_cmpxchg(&q->requeue_state, &old, new));
#ifdef CONFIG_PREEMPT_RT
/* If the waiter interleaved with the requeue let it know */
if (unlikely(old == Q_REQUEUE_PI_WAIT))
rcuwait_wake_up(&q->requeue_wait);
#endif
}
static inline int futex_requeue_pi_wakeup_sync(struct futex_q *q)
{
int old, new;
old = atomic_read_acquire(&q->requeue_state);
do {
/* Is requeue done already? */
if (old >= Q_REQUEUE_PI_DONE)
return old;
/*
* If not done, then tell the requeue code to either ignore
* the waiter or to wake it up once the requeue is done.
*/
new = Q_REQUEUE_PI_WAIT;
if (old == Q_REQUEUE_PI_NONE)
new = Q_REQUEUE_PI_IGNORE;
} while (!atomic_try_cmpxchg(&q->requeue_state, &old, new));
/* If the requeue was in progress, wait for it to complete */
if (old == Q_REQUEUE_PI_IN_PROGRESS) {
#ifdef CONFIG_PREEMPT_RT
rcuwait_wait_event(&q->requeue_wait,
atomic_read(&q->requeue_state) != Q_REQUEUE_PI_WAIT,
TASK_UNINTERRUPTIBLE);
#else
(void)atomic_cond_read_relaxed(&q->requeue_state, VAL != Q_REQUEUE_PI_WAIT);
#endif
}
/*
* Requeue is now either prohibited or complete. Reread state
* because during the wait above it might have changed. Nothing
* will modify q->requeue_state after this point.
*/
return atomic_read(&q->requeue_state);
}
/**
* requeue_pi_wake_futex() - Wake a task that acquired the lock during requeue
* @q: the futex_q
* @key: the key of the requeue target futex
* @hb: the hash_bucket of the requeue target futex
*
* During futex_requeue, with requeue_pi=1, it is possible to acquire the
* target futex if it is uncontended or via a lock steal.
*
* 1) Set @q::key to the requeue target futex key so the waiter can detect
* the wakeup on the right futex.
*
* 2) Dequeue @q from the hash bucket.
*
* 3) Set @q::rt_waiter to NULL so the woken up task can detect atomic lock
* acquisition.
*
* 4) Set the q->lock_ptr to the requeue target hb->lock for the case that
* the waiter has to fixup the pi state.
*
* 5) Complete the requeue state so the waiter can make progress. After
* this point the waiter task can return from the syscall immediately in
* case that the pi state does not have to be fixed up.
*
* 6) Wake the waiter task.
*
* Must be called with both q->lock_ptr and hb->lock held.
*/
static inline
void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
struct futex_hash_bucket *hb)
{
q->key = *key;
__futex_unqueue(q);
WARN_ON(!q->rt_waiter);
q->rt_waiter = NULL;
q->lock_ptr = &hb->lock;
/* Signal locked state to the waiter */
futex_requeue_pi_complete(q, 1);
wake_up_state(q->task, TASK_NORMAL);
}
/**
* futex_proxy_trylock_atomic() - Attempt an atomic lock for the top waiter
* @pifutex: the user address of the to futex
* @hb1: the from futex hash bucket, must be locked by the caller
* @hb2: the to futex hash bucket, must be locked by the caller
* @key1: the from futex key
* @key2: the to futex key
* @ps: address to store the pi_state pointer
* @exiting: Pointer to store the task pointer of the owner task
* which is in the middle of exiting
* @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0)
*
* Try and get the lock on behalf of the top waiter if we can do it atomically.
* Wake the top waiter if we succeed. If the caller specified set_waiters,
* then direct futex_lock_pi_atomic() to force setting the FUTEX_WAITERS bit.
* hb1 and hb2 must be held by the caller.
*
* @exiting is only set when the return value is -EBUSY. If so, this holds
* a refcount on the exiting task on return and the caller needs to drop it
* after waiting for the exit to complete.
*
* Return:
* - 0 - failed to acquire the lock atomically;
* - >0 - acquired the lock, return value is vpid of the top_waiter
* - <0 - error
*/
static int
futex_proxy_trylock_atomic(u32 __user *pifutex, struct futex_hash_bucket *hb1,
struct futex_hash_bucket *hb2, union futex_key *key1,
union futex_key *key2, struct futex_pi_state **ps,
struct task_struct **exiting, int set_waiters)
{
struct futex_q *top_waiter = NULL;
u32 curval;
int ret;
if (futex_get_value_locked(&curval, pifutex))
return -EFAULT;
if (unlikely(should_fail_futex(true)))
return -EFAULT;
/*
* Find the top_waiter and determine if there are additional waiters.
* If the caller intends to requeue more than 1 waiter to pifutex,
* force futex_lock_pi_atomic() to set the FUTEX_WAITERS bit now,
* as we have means to handle the possible fault. If not, don't set
* the bit unnecessarily as it will force the subsequent unlock to enter
* the kernel.
*/
top_waiter = futex_top_waiter(hb1, key1);
/* There are no waiters, nothing for us to do. */
if (!top_waiter)
return 0;
/*
* Ensure that this is a waiter sitting in futex_wait_requeue_pi()
* and waiting on the 'waitqueue' futex which is always !PI.
*/
if (!top_waiter->rt_waiter || top_waiter->pi_state)
return -EINVAL;
/* Ensure we requeue to the expected futex. */
if (!futex_match(top_waiter->requeue_pi_key, key2))
return -EINVAL;
/* Ensure that this does not race against an early wakeup */
if (!futex_requeue_pi_prepare(top_waiter, NULL))
return -EAGAIN;
/*
* Try to take the lock for top_waiter and set the FUTEX_WAITERS bit
* in the contended case or if @set_waiters is true.
*
* In the contended case PI state is attached to the lock owner. If
* the user space lock can be acquired then PI state is attached to
* the new owner (@top_waiter->task) when @set_waiters is true.
*/
ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task,
exiting, set_waiters);
if (ret == 1) {
/*
* Lock was acquired in user space and PI state was
* attached to @top_waiter->task. That means state is fully
* consistent and the waiter can return to user space
* immediately after the wakeup.
*/
requeue_pi_wake_futex(top_waiter, key2, hb2);
} else if (ret < 0) {
/* Rewind top_waiter::requeue_state */
futex_requeue_pi_complete(top_waiter, ret);
} else {
/*
* futex_lock_pi_atomic() did not acquire the user space
* futex, but managed to establish the proxy lock and pi
* state. top_waiter::requeue_state cannot be fixed up here
* because the waiter is not enqueued on the rtmutex
* yet. This is handled at the callsite depending on the
* result of rt_mutex_start_proxy_lock() which is
* guaranteed to be reached with this function returning 0.
*/
}
return ret;
}
/**
* futex_requeue() - Requeue waiters from uaddr1 to uaddr2
* @uaddr1: source futex user address
* @flags: futex flags (FLAGS_SHARED, etc.)
* @uaddr2: target futex user address
* @nr_wake: number of waiters to wake (must be 1 for requeue_pi)
* @nr_requeue: number of waiters to requeue (0-INT_MAX)
* @cmpval: @uaddr1 expected value (or %NULL)
* @requeue_pi: if we are attempting to requeue from a non-pi futex to a
* pi futex (pi to pi requeue is not supported)
*
* Requeue waiters on uaddr1 to uaddr2. In the requeue_pi case, try to acquire
* uaddr2 atomically on behalf of the top waiter.
*
* Return:
* - >=0 - on success, the number of tasks requeued or woken;
* - <0 - on error
*/
int futex_requeue(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
int nr_wake, int nr_requeue, u32 *cmpval, int requeue_pi)
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
int task_count = 0, ret;
struct futex_pi_state *pi_state = NULL;
struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
DEFINE_WAKE_Q(wake_q);
if (nr_wake < 0 || nr_requeue < 0)
return -EINVAL;
/*
* When PI not supported: return -ENOSYS if requeue_pi is true,
* consequently the compiler knows requeue_pi is always false past
* this point which will optimize away all the conditional code
* further down.
*/
if (!IS_ENABLED(CONFIG_FUTEX_PI) && requeue_pi)
return -ENOSYS;
if (requeue_pi) {
/*
* Requeue PI only works on two distinct uaddrs. This
* check is only valid for private futexes. See below.
*/
if (uaddr1 == uaddr2)
return -EINVAL;
/*
* futex_requeue() allows the caller to define the number
* of waiters to wake up via the @nr_wake argument. With
* REQUEUE_PI, waking up more than one waiter is creating
* more problems than it solves. Waking up a waiter makes
* only sense if the PI futex @uaddr2 is uncontended as
* this allows the requeue code to acquire the futex
* @uaddr2 before waking the waiter. The waiter can then
* return to user space without further action. A secondary
* wakeup would just make the futex_wait_requeue_pi()
* handling more complex, because that code would have to
* look up pi_state and do more or less all the handling
* which the requeue code has to do for the to be requeued
* waiters. So restrict the number of waiters to wake to
* one, and only wake it up when the PI futex is
* uncontended. Otherwise requeue it and let the unlock of
* the PI futex handle the wakeup.
*
* All REQUEUE_PI users, e.g. pthread_cond_signal() and
* pthread_cond_broadcast() must use nr_wake=1.
*/
if (nr_wake != 1)
return -EINVAL;
/*
* requeue_pi requires a pi_state, try to allocate it now
* without any locks in case it fails.
*/
if (refill_pi_state_cache())
return -ENOMEM;
}
retry:
ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, FUTEX_READ);
if (unlikely(ret != 0))
return ret;
ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2,
requeue_pi ? FUTEX_WRITE : FUTEX_READ);
if (unlikely(ret != 0))
return ret;
/*
* The check above which compares uaddrs is not sufficient for
* shared futexes. We need to compare the keys:
*/
if (requeue_pi && futex_match(&key1, &key2))
return -EINVAL;
hb1 = futex_hash(&key1);
hb2 = futex_hash(&key2);
retry_private:
futex_hb_waiters_inc(hb2);
double_lock_hb(hb1, hb2);
if (likely(cmpval != NULL)) {
u32 curval;
ret = futex_get_value_locked(&curval, uaddr1);
if (unlikely(ret)) {
double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
ret = get_user(curval, uaddr1);
if (ret)
return ret;
if (!(flags & FLAGS_SHARED))
goto retry_private;
goto retry;
}
if (curval != *cmpval) {
ret = -EAGAIN;
goto out_unlock;
}
}
if (requeue_pi) {
struct task_struct *exiting = NULL;
/*
* Attempt to acquire uaddr2 and wake the top waiter. If we
* intend to requeue waiters, force setting the FUTEX_WAITERS
* bit. We force this here where we are able to easily handle
* faults rather in the requeue loop below.
*
* Updates topwaiter::requeue_state if a top waiter exists.
*/
ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
&key2, &pi_state,
&exiting, nr_requeue);
/*
* At this point the top_waiter has either taken uaddr2 or
* is waiting on it. In both cases pi_state has been
* established and an initial refcount on it. In case of an
* error there's nothing.
*
* The top waiter's requeue_state is up to date:
*
* - If the lock was acquired atomically (ret == 1), then
* the state is Q_REQUEUE_PI_LOCKED.
*
* The top waiter has been dequeued and woken up and can
* return to user space immediately. The kernel/user
* space state is consistent. In case that there must be
* more waiters requeued the WAITERS bit in the user
* space futex is set so the top waiter task has to go
* into the syscall slowpath to unlock the futex. This
* will block until this requeue operation has been
* completed and the hash bucket locks have been
* dropped.
*
* - If the trylock failed with an error (ret < 0) then
* the state is either Q_REQUEUE_PI_NONE, i.e. "nothing
* happened", or Q_REQUEUE_PI_IGNORE when there was an
* interleaved early wakeup.
*
* - If the trylock did not succeed (ret == 0) then the
* state is either Q_REQUEUE_PI_IN_PROGRESS or
* Q_REQUEUE_PI_WAIT if an early wakeup interleaved.
* This will be cleaned up in the loop below, which
* cannot fail because futex_proxy_trylock_atomic() did
* the same sanity checks for requeue_pi as the loop
* below does.
*/
switch (ret) {
case 0:
/* We hold a reference on the pi state. */
break;
case 1:
/*
* futex_proxy_trylock_atomic() acquired the user space
* futex. Adjust task_count.
*/
task_count++;
ret = 0;
break;
/*
* If the above failed, then pi_state is NULL and
* waiter::requeue_state is correct.
*/
case -EFAULT:
double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
ret = fault_in_user_writeable(uaddr2);
if (!ret)
goto retry;
return ret;
case -EBUSY:
case -EAGAIN:
/*
* Two reasons for this:
* - EBUSY: Owner is exiting and we just wait for the
* exit to complete.
* - EAGAIN: The user space value changed.
*/
double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
/*
* Handle the case where the owner is in the middle of
* exiting. Wait for the exit to complete otherwise
* this task might loop forever, aka. live lock.
*/
wait_for_owner_exiting(ret, exiting);
cond_resched();
goto retry;
default:
goto out_unlock;
}
}
plist_for_each_entry_safe(this, next, &hb1->chain, list) {
if (task_count - nr_wake >= nr_requeue)
break;
if (!futex_match(&this->key, &key1))
continue;
/*
* FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI should always
* be paired with each other and no other futex ops.
*
* We should never be requeueing a futex_q with a pi_state,
* which is awaiting a futex_unlock_pi().
*/
if ((requeue_pi && !this->rt_waiter) ||
(!requeue_pi && this->rt_waiter) ||
this->pi_state) {
ret = -EINVAL;
break;
}
/* Plain futexes just wake or requeue and are done */
if (!requeue_pi) {
if (++task_count <= nr_wake)
futex_wake_mark(&wake_q, this);
else
requeue_futex(this, hb1, hb2, &key2);
continue;
}
/* Ensure we requeue to the expected futex for requeue_pi. */
if (!futex_match(this->requeue_pi_key, &key2)) {
ret = -EINVAL;
break;
}
/*
* Requeue nr_requeue waiters and possibly one more in the case
* of requeue_pi if we couldn't acquire the lock atomically.
*
* Prepare the waiter to take the rt_mutex. Take a refcount
* on the pi_state and store the pointer in the futex_q
* object of the waiter.
*/
get_pi_state(pi_state);
/* Don't requeue when the waiter is already on the way out. */
if (!futex_requeue_pi_prepare(this, pi_state)) {
/*
* Early woken waiter signaled that it is on the
* way out. Drop the pi_state reference and try the
* next waiter. @this->pi_state is still NULL.
*/
put_pi_state(pi_state);
continue;
}
ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
this->rt_waiter,
this->task);
if (ret == 1) {
/*
* We got the lock. We do neither drop the refcount
* on pi_state nor clear this->pi_state because the
* waiter needs the pi_state for cleaning up the
* user space value. It will drop the refcount
* after doing so. this::requeue_state is updated
* in the wakeup as well.
*/
requeue_pi_wake_futex(this, &key2, hb2);
task_count++;
} else if (!ret) {
/* Waiter is queued, move it to hb2 */
requeue_futex(this, hb1, hb2, &key2);
futex_requeue_pi_complete(this, 0);
task_count++;
} else {
/*
* rt_mutex_start_proxy_lock() detected a potential
* deadlock when we tried to queue that waiter.
* Drop the pi_state reference which we took above
* and remove the pointer to the state from the
* waiters futex_q object.
*/
this->pi_state = NULL;
put_pi_state(pi_state);
futex_requeue_pi_complete(this, ret);
/*
* We stop queueing more waiters and let user space
* deal with the mess.
*/
break;
}
}
/*
* We took an extra initial reference to the pi_state in
* futex_proxy_trylock_atomic(). We need to drop it here again.
*/
put_pi_state(pi_state);
out_unlock:
double_unlock_hb(hb1, hb2);
wake_up_q(&wake_q);
futex_hb_waiters_dec(hb2);
return ret ? ret : task_count;
}
/**
* handle_early_requeue_pi_wakeup() - Handle early wakeup on the initial futex
* @hb: the hash_bucket futex_q was original enqueued on
* @q: the futex_q woken while waiting to be requeued
* @timeout: the timeout associated with the wait (NULL if none)
*
* Determine the cause for the early wakeup.
*
* Return:
* -EWOULDBLOCK or -ETIMEDOUT or -ERESTARTNOINTR
*/
static inline
int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
struct futex_q *q,
struct hrtimer_sleeper *timeout)
{
int ret;
/*
* With the hb lock held, we avoid races while we process the wakeup.
* We only need to hold hb (and not hb2) to ensure atomicity as the
* wakeup code can't change q.key from uaddr to uaddr2 if we hold hb.
* It can't be requeued from uaddr2 to something else since we don't
* support a PI aware source futex for requeue.
*/
WARN_ON_ONCE(&hb->lock != q->lock_ptr);
/*
* We were woken prior to requeue by a timeout or a signal.
* Unqueue the futex_q and determine which it was.
*/
plist_del(&q->list, &hb->chain);
futex_hb_waiters_dec(hb);
/* Handle spurious wakeups gracefully */
ret = -EWOULDBLOCK;
if (timeout && !timeout->task)
ret = -ETIMEDOUT;
else if (signal_pending(current))
ret = -ERESTARTNOINTR;
return ret;
}
/**
* futex_wait_requeue_pi() - Wait on uaddr and take uaddr2
* @uaddr: the futex we initially wait on (non-pi)
* @flags: futex flags (FLAGS_SHARED, FLAGS_CLOCKRT, etc.), they must be
* the same type, no requeueing from private to shared, etc.
* @val: the expected value of uaddr
* @abs_time: absolute timeout
* @bitset: 32 bit wakeup bitset set by userspace, defaults to all
* @uaddr2: the pi futex we will take prior to returning to user-space
*
* The caller will wait on uaddr and will be requeued by futex_requeue() to
* uaddr2 which must be PI aware and unique from uaddr. Normal wakeup will wake
* on uaddr2 and complete the acquisition of the rt_mutex prior to returning to
* userspace. This ensures the rt_mutex maintains an owner when it has waiters;
* without one, the pi logic would not know which task to boost/deboost, if
* there was a need to.
*
* We call schedule in futex_wait_queue() when we enqueue and return there
* via the following--
* 1) wakeup on uaddr2 after an atomic lock acquisition by futex_requeue()
* 2) wakeup on uaddr2 after a requeue
* 3) signal
* 4) timeout
*
* If 3, cleanup and return -ERESTARTNOINTR.
*
* If 2, we may then block on trying to take the rt_mutex and return via:
* 5) successful lock
* 6) signal
* 7) timeout
* 8) other lock acquisition failure
*
* If 6, return -EWOULDBLOCK (restarting the syscall would do the same).
*
* If 4 or 7, we cleanup and return with -ETIMEDOUT.
*
* Return:
* - 0 - On success;
* - <0 - On error
*/
int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
u32 val, ktime_t *abs_time, u32 bitset,
u32 __user *uaddr2)
{
struct hrtimer_sleeper timeout, *to;
struct rt_mutex_waiter rt_waiter;
struct futex_hash_bucket *hb;
union futex_key key2 = FUTEX_KEY_INIT;
struct futex_q q = futex_q_init;
struct rt_mutex_base *pi_mutex;
int res, ret;
if (!IS_ENABLED(CONFIG_FUTEX_PI))
return -ENOSYS;
if (uaddr == uaddr2)
return -EINVAL;
if (!bitset)
return -EINVAL;
to = futex_setup_timer(abs_time, &timeout, flags,
current->timer_slack_ns);
/*
* The waiter is allocated on our stack, manipulated by the requeue
* code while we sleep on uaddr.
*/
rt_mutex_init_waiter(&rt_waiter);
ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, FUTEX_WRITE);
if (unlikely(ret != 0))
goto out;
q.bitset = bitset;
q.rt_waiter = &rt_waiter;
q.requeue_pi_key = &key2;
/*
* Prepare to wait on uaddr. On success, it holds hb->lock and q
* is initialized.
*/
ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
if (ret)
goto out;
/*
* The check above which compares uaddrs is not sufficient for
* shared futexes. We need to compare the keys:
*/
if (futex_match(&q.key, &key2)) {
futex_q_unlock(hb);
ret = -EINVAL;
goto out;
}
/* Queue the futex_q, drop the hb lock, wait for wakeup. */
futex_wait_queue(hb, &q, to);
switch (futex_requeue_pi_wakeup_sync(&q)) {
case Q_REQUEUE_PI_IGNORE:
/* The waiter is still on uaddr1 */
spin_lock(&hb->lock);
ret = handle_early_requeue_pi_wakeup(hb, &q, to);
spin_unlock(&hb->lock);
break;
case Q_REQUEUE_PI_LOCKED:
/* The requeue acquired the lock */
if (q.pi_state && (q.pi_state->owner != current)) {
spin_lock(q.lock_ptr);
ret = fixup_pi_owner(uaddr2, &q, true);
/*
* Drop the reference to the pi state which the
* requeue_pi() code acquired for us.
*/
put_pi_state(q.pi_state);
spin_unlock(q.lock_ptr);
/*
* Adjust the return value. It's either -EFAULT or
* success (1) but the caller expects 0 for success.
*/
ret = ret < 0 ? ret : 0;
}
break;
case Q_REQUEUE_PI_DONE:
/* Requeue completed. Current is 'pi_blocked_on' the rtmutex */
pi_mutex = &q.pi_state->pi_mutex;
ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
/* Current is not longer pi_blocked_on */
spin_lock(q.lock_ptr);
if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
ret = 0;
debug_rt_mutex_free_waiter(&rt_waiter);
/*
* Fixup the pi_state owner and possibly acquire the lock if we
* haven't already.
*/
res = fixup_pi_owner(uaddr2, &q, !ret);
/*
* If fixup_pi_owner() returned an error, propagate that. If it
* acquired the lock, clear -ETIMEDOUT or -EINTR.
*/
if (res)
ret = (res < 0) ? res : 0;
futex_unqueue_pi(&q);
spin_unlock(q.lock_ptr);
if (ret == -EINTR) {
/*
* We've already been requeued, but cannot restart
* by calling futex_lock_pi() directly. We could
* restart this syscall, but it would detect that
* the user space "val" changed and return
* -EWOULDBLOCK. Save the overhead of the restart
* and return -EWOULDBLOCK directly.
*/
ret = -EWOULDBLOCK;
}
break;
default:
BUG();
}
out:
if (to) {
hrtimer_cancel(&to->timer);
destroy_hrtimer_on_stack(&to->timer);
}
return ret;
}

398
kernel/futex/syscalls.c Normal file
View file

@ -0,0 +1,398 @@
// SPDX-License-Identifier: GPL-2.0-or-later
#include <linux/compat.h>
#include <linux/syscalls.h>
#include <linux/time_namespace.h>
#include "futex.h"
/*
* Support for robust futexes: the kernel cleans up held futexes at
* thread exit time.
*
* Implementation: user-space maintains a per-thread list of locks it
* is holding. Upon do_exit(), the kernel carefully walks this list,
* and marks all locks that are owned by this thread with the
* FUTEX_OWNER_DIED bit, and wakes up a waiter (if any). The list is
* always manipulated with the lock held, so the list is private and
* per-thread. Userspace also maintains a per-thread 'list_op_pending'
* field, to allow the kernel to clean up if the thread dies after
* acquiring the lock, but just before it could have added itself to
* the list. There can only be one such pending lock.
*/
/**
* sys_set_robust_list() - Set the robust-futex list head of a task
* @head: pointer to the list-head
* @len: length of the list-head, as userspace expects
*/
SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
size_t, len)
{
if (!futex_cmpxchg_enabled)
return -ENOSYS;
/*
* The kernel knows only one size for now:
*/
if (unlikely(len != sizeof(*head)))
return -EINVAL;
current->robust_list = head;
return 0;
}
/**
* sys_get_robust_list() - Get the robust-futex list head of a task
* @pid: pid of the process [zero for current task]
* @head_ptr: pointer to a list-head pointer, the kernel fills it in
* @len_ptr: pointer to a length field, the kernel fills in the header size
*/
SYSCALL_DEFINE3(get_robust_list, int, pid,
struct robust_list_head __user * __user *, head_ptr,
size_t __user *, len_ptr)
{
struct robust_list_head __user *head;
unsigned long ret;
struct task_struct *p;
if (!futex_cmpxchg_enabled)
return -ENOSYS;
rcu_read_lock();
ret = -ESRCH;
if (!pid)
p = current;
else {
p = find_task_by_vpid(pid);
if (!p)
goto err_unlock;
}
ret = -EPERM;
if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
goto err_unlock;
head = p->robust_list;
rcu_read_unlock();
if (put_user(sizeof(*head), len_ptr))
return -EFAULT;
return put_user(head, head_ptr);
err_unlock:
rcu_read_unlock();
return ret;
}
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3)
{
int cmd = op & FUTEX_CMD_MASK;
unsigned int flags = 0;
if (!(op & FUTEX_PRIVATE_FLAG))
flags |= FLAGS_SHARED;
if (op & FUTEX_CLOCK_REALTIME) {
flags |= FLAGS_CLOCKRT;
if (cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI &&
cmd != FUTEX_LOCK_PI2)
return -ENOSYS;
}
switch (cmd) {
case FUTEX_LOCK_PI:
case FUTEX_LOCK_PI2:
case FUTEX_UNLOCK_PI:
case FUTEX_TRYLOCK_PI:
case FUTEX_WAIT_REQUEUE_PI:
case FUTEX_CMP_REQUEUE_PI:
if (!futex_cmpxchg_enabled)
return -ENOSYS;
}
switch (cmd) {
case FUTEX_WAIT:
val3 = FUTEX_BITSET_MATCH_ANY;
fallthrough;
case FUTEX_WAIT_BITSET:
return futex_wait(uaddr, flags, val, timeout, val3);
case FUTEX_WAKE:
val3 = FUTEX_BITSET_MATCH_ANY;
fallthrough;
case FUTEX_WAKE_BITSET:
return futex_wake(uaddr, flags, val, val3);
case FUTEX_REQUEUE:
return futex_requeue(uaddr, flags, uaddr2, val, val2, NULL, 0);
case FUTEX_CMP_REQUEUE:
return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 0);
case FUTEX_WAKE_OP:
return futex_wake_op(uaddr, flags, uaddr2, val, val2, val3);
case FUTEX_LOCK_PI:
flags |= FLAGS_CLOCKRT;
fallthrough;
case FUTEX_LOCK_PI2:
return futex_lock_pi(uaddr, flags, timeout, 0);
case FUTEX_UNLOCK_PI:
return futex_unlock_pi(uaddr, flags);
case FUTEX_TRYLOCK_PI:
return futex_lock_pi(uaddr, flags, NULL, 1);
case FUTEX_WAIT_REQUEUE_PI:
val3 = FUTEX_BITSET_MATCH_ANY;
return futex_wait_requeue_pi(uaddr, flags, val, timeout, val3,
uaddr2);
case FUTEX_CMP_REQUEUE_PI:
return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
}
return -ENOSYS;
}
static __always_inline bool futex_cmd_has_timeout(u32 cmd)
{
switch (cmd) {
case FUTEX_WAIT:
case FUTEX_LOCK_PI:
case FUTEX_LOCK_PI2:
case FUTEX_WAIT_BITSET:
case FUTEX_WAIT_REQUEUE_PI:
return true;
}
return false;
}
static __always_inline int
futex_init_timeout(u32 cmd, u32 op, struct timespec64 *ts, ktime_t *t)
{
if (!timespec64_valid(ts))
return -EINVAL;
*t = timespec64_to_ktime(*ts);
if (cmd == FUTEX_WAIT)
*t = ktime_add_safe(ktime_get(), *t);
else if (cmd != FUTEX_LOCK_PI && !(op & FUTEX_CLOCK_REALTIME))
*t = timens_ktime_to_host(CLOCK_MONOTONIC, *t);
return 0;
}
SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
const struct __kernel_timespec __user *, utime,
u32 __user *, uaddr2, u32, val3)
{
int ret, cmd = op & FUTEX_CMD_MASK;
ktime_t t, *tp = NULL;
struct timespec64 ts;
if (utime && futex_cmd_has_timeout(cmd)) {
if (unlikely(should_fail_futex(!(op & FUTEX_PRIVATE_FLAG))))
return -EFAULT;
if (get_timespec64(&ts, utime))
return -EFAULT;
ret = futex_init_timeout(cmd, op, &ts, &t);
if (ret)
return ret;
tp = &t;
}
return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
}
/* Mask of available flags for each futex in futex_waitv list */
#define FUTEXV_WAITER_MASK (FUTEX_32 | FUTEX_PRIVATE_FLAG)
/**
* futex_parse_waitv - Parse a waitv array from userspace
* @futexv: Kernel side list of waiters to be filled
* @uwaitv: Userspace list to be parsed
* @nr_futexes: Length of futexv
*
* Return: Error code on failure, 0 on success
*/
static int futex_parse_waitv(struct futex_vector *futexv,
struct futex_waitv __user *uwaitv,
unsigned int nr_futexes)
{
struct futex_waitv aux;
unsigned int i;
for (i = 0; i < nr_futexes; i++) {
if (copy_from_user(&aux, &uwaitv[i], sizeof(aux)))
return -EFAULT;
if ((aux.flags & ~FUTEXV_WAITER_MASK) || aux.__reserved)
return -EINVAL;
if (!(aux.flags & FUTEX_32))
return -EINVAL;
futexv[i].w.flags = aux.flags;
futexv[i].w.val = aux.val;
futexv[i].w.uaddr = aux.uaddr;
futexv[i].q = futex_q_init;
}
return 0;
}
/**
* sys_futex_waitv - Wait on a list of futexes
* @waiters: List of futexes to wait on
* @nr_futexes: Length of futexv
* @flags: Flag for timeout (monotonic/realtime)
* @timeout: Optional absolute timeout.
* @clockid: Clock to be used for the timeout, realtime or monotonic.
*
* Given an array of `struct futex_waitv`, wait on each uaddr. The thread wakes
* if a futex_wake() is performed at any uaddr. The syscall returns immediately
* if any waiter has *uaddr != val. *timeout is an optional timeout value for
* the operation. Each waiter has individual flags. The `flags` argument for
* the syscall should be used solely for specifying the timeout as realtime, if
* needed. Flags for private futexes, sizes, etc. should be used on the
* individual flags of each waiter.
*
* Returns the array index of one of the woken futexes. No further information
* is provided: any number of other futexes may also have been woken by the
* same event, and if more than one futex was woken, the retrned index may
* refer to any one of them. (It is not necessaryily the futex with the
* smallest index, nor the one most recently woken, nor...)
*/
SYSCALL_DEFINE5(futex_waitv, struct futex_waitv __user *, waiters,
unsigned int, nr_futexes, unsigned int, flags,
struct __kernel_timespec __user *, timeout, clockid_t, clockid)
{
struct hrtimer_sleeper to;
struct futex_vector *futexv;
struct timespec64 ts;
ktime_t time;
int ret;
/* This syscall supports no flags for now */
if (flags)
return -EINVAL;
if (!nr_futexes || nr_futexes > FUTEX_WAITV_MAX || !waiters)
return -EINVAL;
if (timeout) {
int flag_clkid = 0, flag_init = 0;
if (clockid == CLOCK_REALTIME) {
flag_clkid = FLAGS_CLOCKRT;
flag_init = FUTEX_CLOCK_REALTIME;
}
if (clockid != CLOCK_REALTIME && clockid != CLOCK_MONOTONIC)
return -EINVAL;
if (get_timespec64(&ts, timeout))
return -EFAULT;
/*
* Since there's no opcode for futex_waitv, use
* FUTEX_WAIT_BITSET that uses absolute timeout as well
*/
ret = futex_init_timeout(FUTEX_WAIT_BITSET, flag_init, &ts, &time);
if (ret)
return ret;
futex_setup_timer(&time, &to, flag_clkid, 0);
}
futexv = kcalloc(nr_futexes, sizeof(*futexv), GFP_KERNEL);
if (!futexv)
return -ENOMEM;
ret = futex_parse_waitv(futexv, waiters, nr_futexes);
if (!ret)
ret = futex_wait_multiple(futexv, nr_futexes, timeout ? &to : NULL);
if (timeout) {
hrtimer_cancel(&to.timer);
destroy_hrtimer_on_stack(&to.timer);
}
kfree(futexv);
return ret;
}
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE2(set_robust_list,
struct compat_robust_list_head __user *, head,
compat_size_t, len)
{
if (!futex_cmpxchg_enabled)
return -ENOSYS;
if (unlikely(len != sizeof(*head)))
return -EINVAL;
current->compat_robust_list = head;
return 0;
}
COMPAT_SYSCALL_DEFINE3(get_robust_list, int, pid,
compat_uptr_t __user *, head_ptr,
compat_size_t __user *, len_ptr)
{
struct compat_robust_list_head __user *head;
unsigned long ret;
struct task_struct *p;
if (!futex_cmpxchg_enabled)
return -ENOSYS;
rcu_read_lock();
ret = -ESRCH;
if (!pid)
p = current;
else {
p = find_task_by_vpid(pid);
if (!p)
goto err_unlock;
}
ret = -EPERM;
if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS))
goto err_unlock;
head = p->compat_robust_list;
rcu_read_unlock();
if (put_user(sizeof(*head), len_ptr))
return -EFAULT;
return put_user(ptr_to_compat(head), head_ptr);
err_unlock:
rcu_read_unlock();
return ret;
}
#endif /* CONFIG_COMPAT */
#ifdef CONFIG_COMPAT_32BIT_TIME
SYSCALL_DEFINE6(futex_time32, u32 __user *, uaddr, int, op, u32, val,
const struct old_timespec32 __user *, utime, u32 __user *, uaddr2,
u32, val3)
{
int ret, cmd = op & FUTEX_CMD_MASK;
ktime_t t, *tp = NULL;
struct timespec64 ts;
if (utime && futex_cmd_has_timeout(cmd)) {
if (get_old_timespec32(&ts, utime))
return -EFAULT;
ret = futex_init_timeout(cmd, op, &ts, &t);
if (ret)
return ret;
tp = &t;
}
return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
}
#endif /* CONFIG_COMPAT_32BIT_TIME */

708
kernel/futex/waitwake.c Normal file
View file

@ -0,0 +1,708 @@
// SPDX-License-Identifier: GPL-2.0-or-later
#include <linux/sched/task.h>
#include <linux/sched/signal.h>
#include <linux/freezer.h>
#include "futex.h"
/*
* READ this before attempting to hack on futexes!
*
* Basic futex operation and ordering guarantees
* =============================================
*
* The waiter reads the futex value in user space and calls
* futex_wait(). This function computes the hash bucket and acquires
* the hash bucket lock. After that it reads the futex user space value
* again and verifies that the data has not changed. If it has not changed
* it enqueues itself into the hash bucket, releases the hash bucket lock
* and schedules.
*
* The waker side modifies the user space value of the futex and calls
* futex_wake(). This function computes the hash bucket and acquires the
* hash bucket lock. Then it looks for waiters on that futex in the hash
* bucket and wakes them.
*
* In futex wake up scenarios where no tasks are blocked on a futex, taking
* the hb spinlock can be avoided and simply return. In order for this
* optimization to work, ordering guarantees must exist so that the waiter
* being added to the list is acknowledged when the list is concurrently being
* checked by the waker, avoiding scenarios like the following:
*
* CPU 0 CPU 1
* val = *futex;
* sys_futex(WAIT, futex, val);
* futex_wait(futex, val);
* uval = *futex;
* *futex = newval;
* sys_futex(WAKE, futex);
* futex_wake(futex);
* if (queue_empty())
* return;
* if (uval == val)
* lock(hash_bucket(futex));
* queue();
* unlock(hash_bucket(futex));
* schedule();
*
* This would cause the waiter on CPU 0 to wait forever because it
* missed the transition of the user space value from val to newval
* and the waker did not find the waiter in the hash bucket queue.
*
* The correct serialization ensures that a waiter either observes
* the changed user space value before blocking or is woken by a
* concurrent waker:
*
* CPU 0 CPU 1
* val = *futex;
* sys_futex(WAIT, futex, val);
* futex_wait(futex, val);
*
* waiters++; (a)
* smp_mb(); (A) <-- paired with -.
* |
* lock(hash_bucket(futex)); |
* |
* uval = *futex; |
* | *futex = newval;
* | sys_futex(WAKE, futex);
* | futex_wake(futex);
* |
* `--------> smp_mb(); (B)
* if (uval == val)
* queue();
* unlock(hash_bucket(futex));
* schedule(); if (waiters)
* lock(hash_bucket(futex));
* else wake_waiters(futex);
* waiters--; (b) unlock(hash_bucket(futex));
*
* Where (A) orders the waiters increment and the futex value read through
* atomic operations (see futex_hb_waiters_inc) and where (B) orders the write
* to futex and the waiters read (see futex_hb_waiters_pending()).
*
* This yields the following case (where X:=waiters, Y:=futex):
*
* X = Y = 0
*
* w[X]=1 w[Y]=1
* MB MB
* r[Y]=y r[X]=x
*
* Which guarantees that x==0 && y==0 is impossible; which translates back into
* the guarantee that we cannot both miss the futex variable change and the
* enqueue.
*
* Note that a new waiter is accounted for in (a) even when it is possible that
* the wait call can return error, in which case we backtrack from it in (b).
* Refer to the comment in futex_q_lock().
*
* Similarly, in order to account for waiters being requeued on another
* address we always increment the waiters for the destination bucket before
* acquiring the lock. It then decrements them again after releasing it -
* the code that actually moves the futex(es) between hash buckets (requeue_futex)
* will do the additional required waiter count housekeeping. This is done for
* double_lock_hb() and double_unlock_hb(), respectively.
*/
/*
* The hash bucket lock must be held when this is called.
* Afterwards, the futex_q must not be accessed. Callers
* must ensure to later call wake_up_q() for the actual
* wakeups to occur.
*/
void futex_wake_mark(struct wake_q_head *wake_q, struct futex_q *q)
{
struct task_struct *p = q->task;
if (WARN(q->pi_state || q->rt_waiter, "refusing to wake PI futex\n"))
return;
get_task_struct(p);
__futex_unqueue(q);
/*
* The waiting task can free the futex_q as soon as q->lock_ptr = NULL
* is written, without taking any locks. This is possible in the event
* of a spurious wakeup, for example. A memory barrier is required here
* to prevent the following store to lock_ptr from getting ahead of the
* plist_del in __futex_unqueue().
*/
smp_store_release(&q->lock_ptr, NULL);
/*
* Queue the task for later wakeup for after we've released
* the hb->lock.
*/
wake_q_add_safe(wake_q, p);
}
/*
* Wake up waiters matching bitset queued on this futex (uaddr).
*/
int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
union futex_key key = FUTEX_KEY_INIT;
int ret;
DEFINE_WAKE_Q(wake_q);
if (!bitset)
return -EINVAL;
ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, FUTEX_READ);
if (unlikely(ret != 0))
return ret;
hb = futex_hash(&key);
/* Make sure we really have tasks to wakeup */
if (!futex_hb_waiters_pending(hb))
return ret;
spin_lock(&hb->lock);
plist_for_each_entry_safe(this, next, &hb->chain, list) {
if (futex_match (&this->key, &key)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
break;
}
/* Check if one of the bits is set in both bitsets */
if (!(this->bitset & bitset))
continue;
futex_wake_mark(&wake_q, this);
if (++ret >= nr_wake)
break;
}
}
spin_unlock(&hb->lock);
wake_up_q(&wake_q);
return ret;
}
static int futex_atomic_op_inuser(unsigned int encoded_op, u32 __user *uaddr)
{
unsigned int op = (encoded_op & 0x70000000) >> 28;
unsigned int cmp = (encoded_op & 0x0f000000) >> 24;
int oparg = sign_extend32((encoded_op & 0x00fff000) >> 12, 11);
int cmparg = sign_extend32(encoded_op & 0x00000fff, 11);
int oldval, ret;
if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) {
if (oparg < 0 || oparg > 31) {
char comm[sizeof(current->comm)];
/*
* kill this print and return -EINVAL when userspace
* is sane again
*/
pr_info_ratelimited("futex_wake_op: %s tries to shift op by %d; fix this program\n",
get_task_comm(comm, current), oparg);
oparg &= 31;
}
oparg = 1 << oparg;
}
pagefault_disable();
ret = arch_futex_atomic_op_inuser(op, oparg, &oldval, uaddr);
pagefault_enable();
if (ret)
return ret;
switch (cmp) {
case FUTEX_OP_CMP_EQ:
return oldval == cmparg;
case FUTEX_OP_CMP_NE:
return oldval != cmparg;
case FUTEX_OP_CMP_LT:
return oldval < cmparg;
case FUTEX_OP_CMP_GE:
return oldval >= cmparg;
case FUTEX_OP_CMP_LE:
return oldval <= cmparg;
case FUTEX_OP_CMP_GT:
return oldval > cmparg;
default:
return -ENOSYS;
}
}
/*
* Wake up all waiters hashed on the physical page that is mapped
* to this virtual address:
*/
int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
int nr_wake, int nr_wake2, int op)
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
int ret, op_ret;
DEFINE_WAKE_Q(wake_q);
retry:
ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, FUTEX_READ);
if (unlikely(ret != 0))
return ret;
ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, FUTEX_WRITE);
if (unlikely(ret != 0))
return ret;
hb1 = futex_hash(&key1);
hb2 = futex_hash(&key2);
retry_private:
double_lock_hb(hb1, hb2);
op_ret = futex_atomic_op_inuser(op, uaddr2);
if (unlikely(op_ret < 0)) {
double_unlock_hb(hb1, hb2);
if (!IS_ENABLED(CONFIG_MMU) ||
unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
/*
* we don't get EFAULT from MMU faults if we don't have
* an MMU, but we might get them from range checking
*/
ret = op_ret;
return ret;
}
if (op_ret == -EFAULT) {
ret = fault_in_user_writeable(uaddr2);
if (ret)
return ret;
}
cond_resched();
if (!(flags & FLAGS_SHARED))
goto retry_private;
goto retry;
}
plist_for_each_entry_safe(this, next, &hb1->chain, list) {
if (futex_match (&this->key, &key1)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
goto out_unlock;
}
futex_wake_mark(&wake_q, this);
if (++ret >= nr_wake)
break;
}
}
if (op_ret > 0) {
op_ret = 0;
plist_for_each_entry_safe(this, next, &hb2->chain, list) {
if (futex_match (&this->key, &key2)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
goto out_unlock;
}
futex_wake_mark(&wake_q, this);
if (++op_ret >= nr_wake2)
break;
}
}
ret += op_ret;
}
out_unlock:
double_unlock_hb(hb1, hb2);
wake_up_q(&wake_q);
return ret;
}
static long futex_wait_restart(struct restart_block *restart);
/**
* futex_wait_queue() - futex_queue() and wait for wakeup, timeout, or signal
* @hb: the futex hash bucket, must be locked by the caller
* @q: the futex_q to queue up on
* @timeout: the prepared hrtimer_sleeper, or null for no timeout
*/
void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
struct hrtimer_sleeper *timeout)
{
/*
* The task state is guaranteed to be set before another task can
* wake it. set_current_state() is implemented using smp_store_mb() and
* futex_queue() calls spin_unlock() upon completion, both serializing
* access to the hash list and forcing another memory barrier.
*/
set_current_state(TASK_INTERRUPTIBLE);
futex_queue(q, hb);
/* Arm the timer */
if (timeout)
hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
/*
* If we have been removed from the hash list, then another task
* has tried to wake us, and we can skip the call to schedule().
*/
if (likely(!plist_node_empty(&q->list))) {
/*
* If the timer has already expired, current will already be
* flagged for rescheduling. Only call schedule if there
* is no timeout, or if it has yet to expire.
*/
if (!timeout || timeout->task)
freezable_schedule();
}
__set_current_state(TASK_RUNNING);
}
/**
* unqueue_multiple - Remove various futexes from their hash bucket
* @v: The list of futexes to unqueue
* @count: Number of futexes in the list
*
* Helper to unqueue a list of futexes. This can't fail.
*
* Return:
* - >=0 - Index of the last futex that was awoken;
* - -1 - No futex was awoken
*/
static int unqueue_multiple(struct futex_vector *v, int count)
{
int ret = -1, i;
for (i = 0; i < count; i++) {
if (!futex_unqueue(&v[i].q))
ret = i;
}
return ret;
}
/**
* futex_wait_multiple_setup - Prepare to wait and enqueue multiple futexes
* @vs: The futex list to wait on
* @count: The size of the list
* @woken: Index of the last woken futex, if any. Used to notify the
* caller that it can return this index to userspace (return parameter)
*
* Prepare multiple futexes in a single step and enqueue them. This may fail if
* the futex list is invalid or if any futex was already awoken. On success the
* task is ready to interruptible sleep.
*
* Return:
* - 1 - One of the futexes was woken by another thread
* - 0 - Success
* - <0 - -EFAULT, -EWOULDBLOCK or -EINVAL
*/
static int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
{
struct futex_hash_bucket *hb;
bool retry = false;
int ret, i;
u32 uval;
/*
* Enqueuing multiple futexes is tricky, because we need to enqueue
* each futex on the list before dealing with the next one to avoid
* deadlocking on the hash bucket. But, before enqueuing, we need to
* make sure that current->state is TASK_INTERRUPTIBLE, so we don't
* lose any wake events, which cannot be done before the get_futex_key
* of the next key, because it calls get_user_pages, which can sleep.
* Thus, we fetch the list of futexes keys in two steps, by first
* pinning all the memory keys in the futex key, and only then we read
* each key and queue the corresponding futex.
*
* Private futexes doesn't need to recalculate hash in retry, so skip
* get_futex_key() when retrying.
*/
retry:
for (i = 0; i < count; i++) {
if ((vs[i].w.flags & FUTEX_PRIVATE_FLAG) && retry)
continue;
ret = get_futex_key(u64_to_user_ptr(vs[i].w.uaddr),
!(vs[i].w.flags & FUTEX_PRIVATE_FLAG),
&vs[i].q.key, FUTEX_READ);
if (unlikely(ret))
return ret;
}
set_current_state(TASK_INTERRUPTIBLE);
for (i = 0; i < count; i++) {
u32 __user *uaddr = (u32 __user *)(unsigned long)vs[i].w.uaddr;
struct futex_q *q = &vs[i].q;
u32 val = (u32)vs[i].w.val;
hb = futex_q_lock(q);
ret = futex_get_value_locked(&uval, uaddr);
if (!ret && uval == val) {
/*
* The bucket lock can't be held while dealing with the
* next futex. Queue each futex at this moment so hb can
* be unlocked.
*/
futex_queue(q, hb);
continue;
}
futex_q_unlock(hb);
__set_current_state(TASK_RUNNING);
/*
* Even if something went wrong, if we find out that a futex
* was woken, we don't return error and return this index to
* userspace
*/
*woken = unqueue_multiple(vs, i);
if (*woken >= 0)
return 1;
if (ret) {
/*
* If we need to handle a page fault, we need to do so
* without any lock and any enqueued futex (otherwise
* we could lose some wakeup). So we do it here, after
* undoing all the work done so far. In success, we
* retry all the work.
*/
if (get_user(uval, uaddr))
return -EFAULT;
retry = true;
goto retry;
}
if (uval != val)
return -EWOULDBLOCK;
}
return 0;
}
/**
* futex_sleep_multiple - Check sleeping conditions and sleep
* @vs: List of futexes to wait for
* @count: Length of vs
* @to: Timeout
*
* Sleep if and only if the timeout hasn't expired and no futex on the list has
* been woken up.
*/
static void futex_sleep_multiple(struct futex_vector *vs, unsigned int count,
struct hrtimer_sleeper *to)
{
if (to && !to->task)
return;
for (; count; count--, vs++) {
if (!READ_ONCE(vs->q.lock_ptr))
return;
}
freezable_schedule();
}
/**
* futex_wait_multiple - Prepare to wait on and enqueue several futexes
* @vs: The list of futexes to wait on
* @count: The number of objects
* @to: Timeout before giving up and returning to userspace
*
* Entry point for the FUTEX_WAIT_MULTIPLE futex operation, this function
* sleeps on a group of futexes and returns on the first futex that is
* wake, or after the timeout has elapsed.
*
* Return:
* - >=0 - Hint to the futex that was awoken
* - <0 - On error
*/
int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
struct hrtimer_sleeper *to)
{
int ret, hint = 0;
if (to)
hrtimer_sleeper_start_expires(to, HRTIMER_MODE_ABS);
while (1) {
ret = futex_wait_multiple_setup(vs, count, &hint);
if (ret) {
if (ret > 0) {
/* A futex was woken during setup */
ret = hint;
}
return ret;
}
futex_sleep_multiple(vs, count, to);
__set_current_state(TASK_RUNNING);
ret = unqueue_multiple(vs, count);
if (ret >= 0)
return ret;
if (to && !to->task)
return -ETIMEDOUT;
else if (signal_pending(current))
return -ERESTARTSYS;
/*
* The final case is a spurious wakeup, for
* which just retry.
*/
}
}
/**
* futex_wait_setup() - Prepare to wait on a futex
* @uaddr: the futex userspace address
* @val: the expected value
* @flags: futex flags (FLAGS_SHARED, etc.)
* @q: the associated futex_q
* @hb: storage for hash_bucket pointer to be returned to caller
*
* Setup the futex_q and locate the hash_bucket. Get the futex value and
* compare it with the expected value. Handle atomic faults internally.
* Return with the hb lock held on success, and unlocked on failure.
*
* Return:
* - 0 - uaddr contains val and hb has been locked;
* - <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked
*/
int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
struct futex_q *q, struct futex_hash_bucket **hb)
{
u32 uval;
int ret;
/*
* Access the page AFTER the hash-bucket is locked.
* Order is important:
*
* Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
* Userspace waker: if (cond(var)) { var = new; futex_wake(&var); }
*
* The basic logical guarantee of a futex is that it blocks ONLY
* if cond(var) is known to be true at the time of blocking, for
* any cond. If we locked the hash-bucket after testing *uaddr, that
* would open a race condition where we could block indefinitely with
* cond(var) false, which would violate the guarantee.
*
* On the other hand, we insert q and release the hash-bucket only
* after testing *uaddr. This guarantees that futex_wait() will NOT
* absorb a wakeup if *uaddr does not match the desired values
* while the syscall executes.
*/
retry:
ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key, FUTEX_READ);
if (unlikely(ret != 0))
return ret;
retry_private:
*hb = futex_q_lock(q);
ret = futex_get_value_locked(&uval, uaddr);
if (ret) {
futex_q_unlock(*hb);
ret = get_user(uval, uaddr);
if (ret)
return ret;
if (!(flags & FLAGS_SHARED))
goto retry_private;
goto retry;
}
if (uval != val) {
futex_q_unlock(*hb);
ret = -EWOULDBLOCK;
}
return ret;
}
int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
struct hrtimer_sleeper timeout, *to;
struct restart_block *restart;
struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init;
int ret;
if (!bitset)
return -EINVAL;
q.bitset = bitset;
to = futex_setup_timer(abs_time, &timeout, flags,
current->timer_slack_ns);
retry:
/*
* Prepare to wait on uaddr. On success, it holds hb->lock and q
* is initialized.
*/
ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
if (ret)
goto out;
/* futex_queue and wait for wakeup, timeout, or a signal. */
futex_wait_queue(hb, &q, to);
/* If we were woken (and unqueued), we succeeded, whatever. */
ret = 0;
if (!futex_unqueue(&q))
goto out;
ret = -ETIMEDOUT;
if (to && !to->task)
goto out;
/*
* We expect signal_pending(current), but we might be the
* victim of a spurious wakeup as well.
*/
if (!signal_pending(current))
goto retry;
ret = -ERESTARTSYS;
if (!abs_time)
goto out;
restart = &current->restart_block;
restart->futex.uaddr = uaddr;
restart->futex.val = val;
restart->futex.time = *abs_time;
restart->futex.bitset = bitset;
restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;
ret = set_restart_fn(restart, futex_wait_restart);
out:
if (to) {
hrtimer_cancel(&to->timer);
destroy_hrtimer_on_stack(&to->timer);
}
return ret;
}
static long futex_wait_restart(struct restart_block *restart)
{
u32 __user *uaddr = restart->futex.uaddr;
ktime_t t, *tp = NULL;
if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
t = restart->futex.time;
tp = &t;
}
restart->fn = do_no_restart_syscall;
return (long)futex_wait(uaddr, restart->futex.flags,
restart->futex.val, tp, restart->futex.bitset);
}

View file

@ -4671,7 +4671,7 @@ print_lock_invalid_wait_context(struct task_struct *curr,
/*
* Verify the wait_type context.
*
* This check validates we takes locks in the right wait-type order; that is it
* This check validates we take locks in the right wait-type order; that is it
* ensures that we do not take mutexes inside spinlocks and do not attempt to
* acquire spinlocks inside raw_spinlocks and the sort.
*
@ -5366,7 +5366,7 @@ int __lock_is_held(const struct lockdep_map *lock, int read)
struct held_lock *hlock = curr->held_locks + i;
if (match_held_lock(hlock, lock)) {
if (read == -1 || hlock->read == read)
if (read == -1 || !!hlock->read == read)
return LOCK_STATE_HELD;
return LOCK_STATE_NOT_HELD;

View file

@ -94,6 +94,9 @@ static inline unsigned long __owner_flags(unsigned long owner)
return owner & MUTEX_FLAGS;
}
/*
* Returns: __mutex_owner(lock) on failure or NULL on success.
*/
static inline struct task_struct *__mutex_trylock_common(struct mutex *lock, bool handoff)
{
unsigned long owner, curr = (unsigned long)current;
@ -348,13 +351,16 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner,
{
bool ret = true;
rcu_read_lock();
lockdep_assert_preemption_disabled();
while (__mutex_owner(lock) == owner) {
/*
* Ensure we emit the owner->on_cpu, dereference _after_
* checking lock->owner still matches owner. If that fails,
* owner might point to freed memory. If it still matches,
* the rcu_read_lock() ensures the memory stays valid.
* checking lock->owner still matches owner. And we already
* disabled preemption which is equal to the RCU read-side
* crital section in optimistic spinning code. Thus the
* task_strcut structure won't go away during the spinning
* period
*/
barrier();
@ -374,7 +380,6 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner,
cpu_relax();
}
rcu_read_unlock();
return ret;
}
@ -387,19 +392,25 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
struct task_struct *owner;
int retval = 1;
lockdep_assert_preemption_disabled();
if (need_resched())
return 0;
rcu_read_lock();
/*
* We already disabled preemption which is equal to the RCU read-side
* crital section in optimistic spinning code. Thus the task_strcut
* structure won't go away during the spinning period.
*/
owner = __mutex_owner(lock);
/*
* As lock holder preemption issue, we both skip spinning if task is not
* on cpu or its cpu is preempted
*/
if (owner)
retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
rcu_read_unlock();
/*
* If lock->owner is not set, the mutex has been released. Return true
@ -736,6 +747,44 @@ __ww_mutex_lock(struct mutex *lock, unsigned int state, unsigned int subclass,
return __mutex_lock_common(lock, state, subclass, NULL, ip, ww_ctx, true);
}
/**
* ww_mutex_trylock - tries to acquire the w/w mutex with optional acquire context
* @ww: mutex to lock
* @ww_ctx: optional w/w acquire context
*
* Trylocks a mutex with the optional acquire context; no deadlock detection is
* possible. Returns 1 if the mutex has been acquired successfully, 0 otherwise.
*
* Unlike ww_mutex_lock, no deadlock handling is performed. However, if a @ctx is
* specified, -EALREADY handling may happen in calls to ww_mutex_trylock.
*
* A mutex acquired with this function must be released with ww_mutex_unlock.
*/
int ww_mutex_trylock(struct ww_mutex *ww, struct ww_acquire_ctx *ww_ctx)
{
if (!ww_ctx)
return mutex_trylock(&ww->base);
MUTEX_WARN_ON(ww->base.magic != &ww->base);
/*
* Reset the wounded flag after a kill. No other process can
* race and wound us here, since they can't have a valid owner
* pointer if we don't have any locks held.
*/
if (ww_ctx->acquired == 0)
ww_ctx->wounded = 0;
if (__mutex_trylock(&ww->base)) {
ww_mutex_set_context_fastpath(ww, ww_ctx);
mutex_acquire_nest(&ww->base.dep_map, 0, 1, &ww_ctx->dep_map, _RET_IP_);
return 1;
}
return 0;
}
EXPORT_SYMBOL(ww_mutex_trylock);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
void __sched
mutex_lock_nested(struct mutex *lock, unsigned int subclass)

View file

@ -446,17 +446,24 @@ static __always_inline void rt_mutex_adjust_prio(struct task_struct *p)
}
/* RT mutex specific wake_q wrappers */
static __always_inline void rt_mutex_wake_q_add_task(struct rt_wake_q_head *wqh,
struct task_struct *task,
unsigned int wake_state)
{
if (IS_ENABLED(CONFIG_PREEMPT_RT) && wake_state == TASK_RTLOCK_WAIT) {
if (IS_ENABLED(CONFIG_PROVE_LOCKING))
WARN_ON_ONCE(wqh->rtlock_task);
get_task_struct(task);
wqh->rtlock_task = task;
} else {
wake_q_add(&wqh->head, task);
}
}
static __always_inline void rt_mutex_wake_q_add(struct rt_wake_q_head *wqh,
struct rt_mutex_waiter *w)
{
if (IS_ENABLED(CONFIG_PREEMPT_RT) && w->wake_state != TASK_NORMAL) {
if (IS_ENABLED(CONFIG_PROVE_LOCKING))
WARN_ON_ONCE(wqh->rtlock_task);
get_task_struct(w->task);
wqh->rtlock_task = w->task;
} else {
wake_q_add(&wqh->head, w->task);
}
rt_mutex_wake_q_add_task(wqh, w->task, w->wake_state);
}
static __always_inline void rt_mutex_wake_up_q(struct rt_wake_q_head *wqh)

View file

@ -59,8 +59,7 @@ static __always_inline int rwbase_read_trylock(struct rwbase_rt *rwb)
* set.
*/
for (r = atomic_read(&rwb->readers); r < 0;) {
/* Fully-ordered if cmpxchg() succeeds, provides ACQUIRE */
if (likely(atomic_try_cmpxchg(&rwb->readers, &r, r + 1)))
if (likely(atomic_try_cmpxchg_acquire(&rwb->readers, &r, r + 1)))
return 1;
}
return 0;
@ -148,6 +147,7 @@ static void __sched __rwbase_read_unlock(struct rwbase_rt *rwb,
{
struct rt_mutex_base *rtm = &rwb->rtmutex;
struct task_struct *owner;
DEFINE_RT_WAKE_Q(wqh);
raw_spin_lock_irq(&rtm->wait_lock);
/*
@ -158,9 +158,12 @@ static void __sched __rwbase_read_unlock(struct rwbase_rt *rwb,
*/
owner = rt_mutex_owner(rtm);
if (owner)
wake_up_state(owner, state);
rt_mutex_wake_q_add_task(&wqh, owner, state);
/* Pairs with the preempt_enable in rt_mutex_wake_up_q() */
preempt_disable();
raw_spin_unlock_irq(&rtm->wait_lock);
rt_mutex_wake_up_q(&wqh);
}
static __always_inline void rwbase_read_unlock(struct rwbase_rt *rwb,
@ -183,7 +186,7 @@ static inline void __rwbase_write_unlock(struct rwbase_rt *rwb, int bias,
/*
* _release() is needed in case that reader is in fast path, pairing
* with atomic_try_cmpxchg() in rwbase_read_trylock(), provides RELEASE
* with atomic_try_cmpxchg_acquire() in rwbase_read_trylock().
*/
(void)atomic_add_return_release(READER_BIAS - bias, &rwb->readers);
raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);

View file

@ -56,7 +56,6 @@
*
* A fast path reader optimistic lock stealing is supported when the rwsem
* is previously owned by a writer and the following conditions are met:
* - OSQ is empty
* - rwsem is not currently writer owned
* - the handoff isn't set.
*/
@ -485,7 +484,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
/*
* Limit # of readers that can be woken up per wakeup call.
*/
if (woken >= MAX_READERS_WAKEUP)
if (unlikely(woken >= MAX_READERS_WAKEUP))
break;
}
@ -577,6 +576,24 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
return true;
}
/*
* The rwsem_spin_on_owner() function returns the following 4 values
* depending on the lock owner state.
* OWNER_NULL : owner is currently NULL
* OWNER_WRITER: when owner changes and is a writer
* OWNER_READER: when owner changes and the new owner may be a reader.
* OWNER_NONSPINNABLE:
* when optimistic spinning has to stop because either the
* owner stops running, is unknown, or its timeslice has
* been used up.
*/
enum owner_state {
OWNER_NULL = 1 << 0,
OWNER_WRITER = 1 << 1,
OWNER_READER = 1 << 2,
OWNER_NONSPINNABLE = 1 << 3,
};
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
/*
* Try to acquire write lock before the writer has been put on wait queue.
@ -617,7 +634,10 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
}
preempt_disable();
rcu_read_lock();
/*
* Disable preemption is equal to the RCU read-side crital section,
* thus the task_strcut structure won't go away.
*/
owner = rwsem_owner_flags(sem, &flags);
/*
* Don't check the read-owner as the entry may be stale.
@ -625,30 +645,12 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
if ((flags & RWSEM_NONSPINNABLE) ||
(owner && !(flags & RWSEM_READER_OWNED) && !owner_on_cpu(owner)))
ret = false;
rcu_read_unlock();
preempt_enable();
lockevent_cond_inc(rwsem_opt_fail, !ret);
return ret;
}
/*
* The rwsem_spin_on_owner() function returns the following 4 values
* depending on the lock owner state.
* OWNER_NULL : owner is currently NULL
* OWNER_WRITER: when owner changes and is a writer
* OWNER_READER: when owner changes and the new owner may be a reader.
* OWNER_NONSPINNABLE:
* when optimistic spinning has to stop because either the
* owner stops running, is unknown, or its timeslice has
* been used up.
*/
enum owner_state {
OWNER_NULL = 1 << 0,
OWNER_WRITER = 1 << 1,
OWNER_READER = 1 << 2,
OWNER_NONSPINNABLE = 1 << 3,
};
#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER | OWNER_READER)
static inline enum owner_state
@ -670,12 +672,13 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
unsigned long flags, new_flags;
enum owner_state state;
lockdep_assert_preemption_disabled();
owner = rwsem_owner_flags(sem, &flags);
state = rwsem_owner_state(owner, flags);
if (state != OWNER_WRITER)
return state;
rcu_read_lock();
for (;;) {
/*
* When a waiting writer set the handoff flag, it may spin
@ -693,7 +696,9 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
* Ensure we emit the owner->on_cpu, dereference _after_
* checking sem->owner still matches owner, if that fails,
* owner might point to free()d memory, if it still matches,
* the rcu_read_lock() ensures the memory stays valid.
* our spinning context already disabled preemption which is
* equal to RCU read-side crital section ensures the memory
* stays valid.
*/
barrier();
@ -704,7 +709,6 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
cpu_relax();
}
rcu_read_unlock();
return state;
}
@ -878,12 +882,11 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem)
static inline void clear_nonspinnable(struct rw_semaphore *sem) { }
static inline int
static inline enum owner_state
rwsem_spin_on_owner(struct rw_semaphore *sem)
{
return 0;
return OWNER_NONSPINNABLE;
}
#define OWNER_NULL 1
#endif
/*
@ -1095,9 +1098,16 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
* In this case, we attempt to acquire the lock again
* without sleeping.
*/
if (wstate == WRITER_HANDOFF &&
rwsem_spin_on_owner(sem) == OWNER_NULL)
goto trylock_again;
if (wstate == WRITER_HANDOFF) {
enum owner_state owner_state;
preempt_disable();
owner_state = rwsem_spin_on_owner(sem);
preempt_enable();
if (owner_state == OWNER_NULL)
goto trylock_again;
}
/* Block until there are no active lockers. */
for (;;) {

View file

@ -378,8 +378,7 @@ unsigned long __lockfunc _raw_spin_lock_irqsave_nested(raw_spinlock_t *lock,
local_irq_save(flags);
preempt_disable();
spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_);
LOCK_CONTENDED_FLAGS(lock, do_raw_spin_trylock, do_raw_spin_lock,
do_raw_spin_lock_flags, &flags);
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
return flags;
}
EXPORT_SYMBOL(_raw_spin_lock_irqsave_nested);

View file

@ -24,6 +24,17 @@
#define RT_MUTEX_BUILD_SPINLOCKS
#include "rtmutex.c"
/*
* __might_resched() skips the state check as rtlocks are state
* preserving. Take RCU nesting into account as spin/read/write_lock() can
* legitimately nest into an RCU read side critical section.
*/
#define RTLOCK_RESCHED_OFFSETS \
(rcu_preempt_depth() << MIGHT_RESCHED_RCU_SHIFT)
#define rtlock_might_resched() \
__might_resched(__FILE__, __LINE__, RTLOCK_RESCHED_OFFSETS)
static __always_inline void rtlock_lock(struct rt_mutex_base *rtm)
{
if (unlikely(!rt_mutex_cmpxchg_acquire(rtm, NULL, current)))
@ -32,7 +43,7 @@ static __always_inline void rtlock_lock(struct rt_mutex_base *rtm)
static __always_inline void __rt_spin_lock(spinlock_t *lock)
{
___might_sleep(__FILE__, __LINE__, 0);
rtlock_might_resched();
rtlock_lock(&lock->lock);
rcu_read_lock();
migrate_disable();
@ -210,7 +221,7 @@ EXPORT_SYMBOL(rt_write_trylock);
void __sched rt_read_lock(rwlock_t *rwlock)
{
___might_sleep(__FILE__, __LINE__, 0);
rtlock_might_resched();
rwlock_acquire_read(&rwlock->dep_map, 0, 0, _RET_IP_);
rwbase_read_lock(&rwlock->rwbase, TASK_RTLOCK_WAIT);
rcu_read_lock();
@ -220,7 +231,7 @@ EXPORT_SYMBOL(rt_read_lock);
void __sched rt_write_lock(rwlock_t *rwlock)
{
___might_sleep(__FILE__, __LINE__, 0);
rtlock_might_resched();
rwlock_acquire(&rwlock->dep_map, 0, 0, _RET_IP_);
rwbase_write_lock(&rwlock->rwbase, TASK_RTLOCK_WAIT);
rcu_read_lock();

View file

@ -16,6 +16,15 @@
static DEFINE_WD_CLASS(ww_class);
struct workqueue_struct *wq;
#ifdef CONFIG_DEBUG_WW_MUTEX_SLOWPATH
#define ww_acquire_init_noinject(a, b) do { \
ww_acquire_init((a), (b)); \
(a)->deadlock_inject_countdown = ~0U; \
} while (0)
#else
#define ww_acquire_init_noinject(a, b) ww_acquire_init((a), (b))
#endif
struct test_mutex {
struct work_struct work;
struct ww_mutex mutex;
@ -36,7 +45,7 @@ static void test_mutex_work(struct work_struct *work)
wait_for_completion(&mtx->go);
if (mtx->flags & TEST_MTX_TRY) {
while (!ww_mutex_trylock(&mtx->mutex))
while (!ww_mutex_trylock(&mtx->mutex, NULL))
cond_resched();
} else {
ww_mutex_lock(&mtx->mutex, NULL);
@ -109,19 +118,39 @@ static int test_mutex(void)
return 0;
}
static int test_aa(void)
static int test_aa(bool trylock)
{
struct ww_mutex mutex;
struct ww_acquire_ctx ctx;
int ret;
const char *from = trylock ? "trylock" : "lock";
ww_mutex_init(&mutex, &ww_class);
ww_acquire_init(&ctx, &ww_class);
ww_mutex_lock(&mutex, &ctx);
if (!trylock) {
ret = ww_mutex_lock(&mutex, &ctx);
if (ret) {
pr_err("%s: initial lock failed!\n", __func__);
goto out;
}
} else {
ret = !ww_mutex_trylock(&mutex, &ctx);
if (ret) {
pr_err("%s: initial trylock failed!\n", __func__);
goto out;
}
}
if (ww_mutex_trylock(&mutex)) {
pr_err("%s: trylocked itself!\n", __func__);
if (ww_mutex_trylock(&mutex, NULL)) {
pr_err("%s: trylocked itself without context from %s!\n", __func__, from);
ww_mutex_unlock(&mutex);
ret = -EINVAL;
goto out;
}
if (ww_mutex_trylock(&mutex, &ctx)) {
pr_err("%s: trylocked itself with context from %s!\n", __func__, from);
ww_mutex_unlock(&mutex);
ret = -EINVAL;
goto out;
@ -129,17 +158,17 @@ static int test_aa(void)
ret = ww_mutex_lock(&mutex, &ctx);
if (ret != -EALREADY) {
pr_err("%s: missed deadlock for recursing, ret=%d\n",
__func__, ret);
pr_err("%s: missed deadlock for recursing, ret=%d from %s\n",
__func__, ret, from);
if (!ret)
ww_mutex_unlock(&mutex);
ret = -EINVAL;
goto out;
}
ww_mutex_unlock(&mutex);
ret = 0;
out:
ww_mutex_unlock(&mutex);
ww_acquire_fini(&ctx);
return ret;
}
@ -150,7 +179,7 @@ struct test_abba {
struct ww_mutex b_mutex;
struct completion a_ready;
struct completion b_ready;
bool resolve;
bool resolve, trylock;
int result;
};
@ -160,8 +189,13 @@ static void test_abba_work(struct work_struct *work)
struct ww_acquire_ctx ctx;
int err;
ww_acquire_init(&ctx, &ww_class);
ww_mutex_lock(&abba->b_mutex, &ctx);
ww_acquire_init_noinject(&ctx, &ww_class);
if (!abba->trylock)
ww_mutex_lock(&abba->b_mutex, &ctx);
else
WARN_ON(!ww_mutex_trylock(&abba->b_mutex, &ctx));
WARN_ON(READ_ONCE(abba->b_mutex.ctx) != &ctx);
complete(&abba->b_ready);
wait_for_completion(&abba->a_ready);
@ -181,7 +215,7 @@ static void test_abba_work(struct work_struct *work)
abba->result = err;
}
static int test_abba(bool resolve)
static int test_abba(bool trylock, bool resolve)
{
struct test_abba abba;
struct ww_acquire_ctx ctx;
@ -192,12 +226,18 @@ static int test_abba(bool resolve)
INIT_WORK_ONSTACK(&abba.work, test_abba_work);
init_completion(&abba.a_ready);
init_completion(&abba.b_ready);
abba.trylock = trylock;
abba.resolve = resolve;
schedule_work(&abba.work);
ww_acquire_init(&ctx, &ww_class);
ww_mutex_lock(&abba.a_mutex, &ctx);
ww_acquire_init_noinject(&ctx, &ww_class);
if (!trylock)
ww_mutex_lock(&abba.a_mutex, &ctx);
else
WARN_ON(!ww_mutex_trylock(&abba.a_mutex, &ctx));
WARN_ON(READ_ONCE(abba.a_mutex.ctx) != &ctx);
complete(&abba.a_ready);
wait_for_completion(&abba.b_ready);
@ -249,7 +289,7 @@ static void test_cycle_work(struct work_struct *work)
struct ww_acquire_ctx ctx;
int err, erra = 0;
ww_acquire_init(&ctx, &ww_class);
ww_acquire_init_noinject(&ctx, &ww_class);
ww_mutex_lock(&cycle->a_mutex, &ctx);
complete(cycle->a_signal);
@ -581,7 +621,9 @@ static int stress(int nlocks, int nthreads, unsigned int flags)
static int __init test_ww_mutex_init(void)
{
int ncpus = num_online_cpus();
int ret;
int ret, i;
printk(KERN_INFO "Beginning ww mutex selftests\n");
wq = alloc_workqueue("test-ww_mutex", WQ_UNBOUND, 0);
if (!wq)
@ -591,17 +633,19 @@ static int __init test_ww_mutex_init(void)
if (ret)
return ret;
ret = test_aa();
ret = test_aa(false);
if (ret)
return ret;
ret = test_abba(false);
ret = test_aa(true);
if (ret)
return ret;
ret = test_abba(true);
if (ret)
return ret;
for (i = 0; i < 4; i++) {
ret = test_abba(i & 1, i & 2);
if (ret)
return ret;
}
ret = test_cycle(ncpus);
if (ret)
@ -619,6 +663,7 @@ static int __init test_ww_mutex_init(void)
if (ret)
return ret;
printk(KERN_INFO "All ww mutex selftests passed\n");
return 0;
}

View file

@ -9,6 +9,31 @@
#define WW_RT
#include "rtmutex.c"
int ww_mutex_trylock(struct ww_mutex *lock, struct ww_acquire_ctx *ww_ctx)
{
struct rt_mutex *rtm = &lock->base;
if (!ww_ctx)
return rt_mutex_trylock(rtm);
/*
* Reset the wounded flag after a kill. No other process can
* race and wound us here, since they can't have a valid owner
* pointer if we don't have any locks held.
*/
if (ww_ctx->acquired == 0)
ww_ctx->wounded = 0;
if (__rt_mutex_trylock(&rtm->rtmutex)) {
ww_mutex_set_context_fastpath(lock, ww_ctx);
mutex_acquire_nest(&rtm->dep_map, 0, 1, ww_ctx->dep_map, _RET_IP_);
return 1;
}
return 0;
}
EXPORT_SYMBOL(ww_mutex_trylock);
static int __sched
__ww_rt_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ww_ctx,
unsigned int state, unsigned long ip)

View file

@ -247,7 +247,7 @@ struct lockdep_map rcu_lock_map = {
.name = "rcu_read_lock",
.key = &rcu_lock_key,
.wait_type_outer = LD_WAIT_FREE,
.wait_type_inner = LD_WAIT_CONFIG, /* XXX PREEMPT_RCU ? */
.wait_type_inner = LD_WAIT_CONFIG, /* PREEMPT_RT implies PREEMPT_RCU */
};
EXPORT_SYMBOL_GPL(rcu_lock_map);
@ -256,7 +256,7 @@ struct lockdep_map rcu_bh_lock_map = {
.name = "rcu_read_lock_bh",
.key = &rcu_bh_lock_key,
.wait_type_outer = LD_WAIT_FREE,
.wait_type_inner = LD_WAIT_CONFIG, /* PREEMPT_LOCK also makes BH preemptible */
.wait_type_inner = LD_WAIT_CONFIG, /* PREEMPT_RT makes BH preemptible. */
};
EXPORT_SYMBOL_GPL(rcu_bh_lock_map);

View file

@ -9470,14 +9470,8 @@ void __init sched_init(void)
}
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
static inline int preempt_count_equals(int preempt_offset)
{
int nested = preempt_count() + rcu_preempt_depth();
return (nested == preempt_offset);
}
void __might_sleep(const char *file, int line, int preempt_offset)
void __might_sleep(const char *file, int line)
{
unsigned int state = get_current_state();
/*
@ -9491,11 +9485,32 @@ void __might_sleep(const char *file, int line, int preempt_offset)
(void *)current->task_state_change,
(void *)current->task_state_change);
___might_sleep(file, line, preempt_offset);
__might_resched(file, line, 0);
}
EXPORT_SYMBOL(__might_sleep);
void ___might_sleep(const char *file, int line, int preempt_offset)
static void print_preempt_disable_ip(int preempt_offset, unsigned long ip)
{
if (!IS_ENABLED(CONFIG_DEBUG_PREEMPT))
return;
if (preempt_count() == preempt_offset)
return;
pr_err("Preemption disabled at:");
print_ip_sym(KERN_ERR, ip);
}
static inline bool resched_offsets_ok(unsigned int offsets)
{
unsigned int nested = preempt_count();
nested += rcu_preempt_depth() << MIGHT_RESCHED_RCU_SHIFT;
return nested == offsets;
}
void __might_resched(const char *file, int line, unsigned int offsets)
{
/* Ratelimiting timestamp: */
static unsigned long prev_jiffy;
@ -9505,7 +9520,7 @@ void ___might_sleep(const char *file, int line, int preempt_offset)
/* WARN_ON_ONCE() by default, no rate limit required: */
rcu_sleep_check();
if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
if ((resched_offsets_ok(offsets) && !irqs_disabled() &&
!is_idle_task(current) && !current->non_block_count) ||
system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING ||
oops_in_progress)
@ -9518,29 +9533,33 @@ void ___might_sleep(const char *file, int line, int preempt_offset)
/* Save this before calling printk(), since that will clobber it: */
preempt_disable_ip = get_preempt_disable_ip(current);
printk(KERN_ERR
"BUG: sleeping function called from invalid context at %s:%d\n",
file, line);
printk(KERN_ERR
"in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n",
in_atomic(), irqs_disabled(), current->non_block_count,
current->pid, current->comm);
pr_err("BUG: sleeping function called from invalid context at %s:%d\n",
file, line);
pr_err("in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n",
in_atomic(), irqs_disabled(), current->non_block_count,
current->pid, current->comm);
pr_err("preempt_count: %x, expected: %x\n", preempt_count(),
offsets & MIGHT_RESCHED_PREEMPT_MASK);
if (IS_ENABLED(CONFIG_PREEMPT_RCU)) {
pr_err("RCU nest depth: %d, expected: %u\n",
rcu_preempt_depth(), offsets >> MIGHT_RESCHED_RCU_SHIFT);
}
if (task_stack_end_corrupted(current))
printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
pr_emerg("Thread overran stack, or stack corrupted\n");
debug_show_held_locks(current);
if (irqs_disabled())
print_irqtrace_events(current);
if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)
&& !preempt_count_equals(preempt_offset)) {
pr_err("Preemption disabled at:");
print_ip_sym(KERN_ERR, preempt_disable_ip);
}
print_preempt_disable_ip(offsets & MIGHT_RESCHED_PREEMPT_MASK,
preempt_disable_ip);
dump_stack();
add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
}
EXPORT_SYMBOL(___might_sleep);
EXPORT_SYMBOL(__might_resched);
void __cant_sleep(const char *file, int line, int preempt_offset)
{

View file

@ -143,13 +143,14 @@ COND_SYSCALL(capset);
/* __ARCH_WANT_SYS_CLONE3 */
COND_SYSCALL(clone3);
/* kernel/futex.c */
/* kernel/futex/syscalls.c */
COND_SYSCALL(futex);
COND_SYSCALL(futex_time32);
COND_SYSCALL(set_robust_list);
COND_SYSCALL_COMPAT(set_robust_list);
COND_SYSCALL(get_robust_list);
COND_SYSCALL_COMPAT(get_robust_list);
COND_SYSCALL(futex_waitv);
/* kernel/hrtimer.c */

View file

@ -258,7 +258,7 @@ static void init_shared_classes(void)
#define WWAF(x) ww_acquire_fini(x)
#define WWL(x, c) ww_mutex_lock(x, c)
#define WWT(x) ww_mutex_trylock(x)
#define WWT(x) ww_mutex_trylock(x, NULL)
#define WWL1(x) ww_mutex_lock(x, NULL)
#define WWU(x) ww_mutex_unlock(x)

View file

@ -5267,7 +5267,7 @@ void __might_fault(const char *file, int line)
return;
if (pagefault_disabled())
return;
__might_sleep(file, line, 0);
__might_sleep(file, line);
#if defined(CONFIG_DEBUG_ATOMIC_SLEEP)
if (current->mm)
might_lock_read(&current->mm->mmap_lock);

View file

@ -8,3 +8,4 @@ futex_wait_uninitialized_heap
futex_wait_wouldblock
futex_wait
futex_requeue
futex_waitv

View file

@ -17,7 +17,8 @@ TEST_GEN_FILES := \
futex_wait_uninitialized_heap \
futex_wait_private_mapped_file \
futex_wait \
futex_requeue
futex_requeue \
futex_waitv
TEST_PROGS := run.sh

View file

@ -17,6 +17,7 @@
#include <pthread.h>
#include "futextest.h"
#include "futex2test.h"
#include "logging.h"
#define TEST_NAME "futex-wait-timeout"
@ -96,6 +97,12 @@ int main(int argc, char *argv[])
struct timespec to;
pthread_t thread;
int c;
struct futex_waitv waitv = {
.uaddr = (uintptr_t)&f1,
.val = f1,
.flags = FUTEX_32,
.__reserved = 0
};
while ((c = getopt(argc, argv, "cht:v:")) != -1) {
switch (c) {
@ -118,7 +125,7 @@ int main(int argc, char *argv[])
}
ksft_print_header();
ksft_set_plan(7);
ksft_set_plan(9);
ksft_print_msg("%s: Block on a futex and wait for timeout\n",
basename(argv[0]));
ksft_print_msg("\tArguments: timeout=%ldns\n", timeout_ns);
@ -175,6 +182,18 @@ int main(int argc, char *argv[])
res = futex_lock_pi(&futex_pi, NULL, 0, FUTEX_CLOCK_REALTIME);
test_timeout(res, &ret, "futex_lock_pi invalid timeout flag", ENOSYS);
/* futex_waitv with CLOCK_MONOTONIC */
if (futex_get_abs_timeout(CLOCK_MONOTONIC, &to, timeout_ns))
return RET_FAIL;
res = futex_waitv(&waitv, 1, 0, &to, CLOCK_MONOTONIC);
test_timeout(res, &ret, "futex_waitv monotonic", ETIMEDOUT);
/* futex_waitv with CLOCK_REALTIME */
if (futex_get_abs_timeout(CLOCK_REALTIME, &to, timeout_ns))
return RET_FAIL;
res = futex_waitv(&waitv, 1, 0, &to, CLOCK_REALTIME);
test_timeout(res, &ret, "futex_waitv realtime", ETIMEDOUT);
ksft_print_cnts();
return ret;
}

View file

@ -22,6 +22,7 @@
#include <string.h>
#include <time.h>
#include "futextest.h"
#include "futex2test.h"
#include "logging.h"
#define TEST_NAME "futex-wait-wouldblock"
@ -42,6 +43,12 @@ int main(int argc, char *argv[])
futex_t f1 = FUTEX_INITIALIZER;
int res, ret = RET_PASS;
int c;
struct futex_waitv waitv = {
.uaddr = (uintptr_t)&f1,
.val = f1+1,
.flags = FUTEX_32,
.__reserved = 0
};
while ((c = getopt(argc, argv, "cht:v:")) != -1) {
switch (c) {
@ -61,18 +68,44 @@ int main(int argc, char *argv[])
}
ksft_print_header();
ksft_set_plan(1);
ksft_set_plan(2);
ksft_print_msg("%s: Test the unexpected futex value in FUTEX_WAIT\n",
basename(argv[0]));
info("Calling futex_wait on f1: %u @ %p with val=%u\n", f1, &f1, f1+1);
res = futex_wait(&f1, f1+1, &to, FUTEX_PRIVATE_FLAG);
if (!res || errno != EWOULDBLOCK) {
fail("futex_wait returned: %d %s\n",
res ? errno : res, res ? strerror(errno) : "");
ksft_test_result_fail("futex_wait returned: %d %s\n",
res ? errno : res,
res ? strerror(errno) : "");
ret = RET_FAIL;
} else {
ksft_test_result_pass("futex_wait\n");
}
print_result(TEST_NAME, ret);
if (clock_gettime(CLOCK_MONOTONIC, &to)) {
error("clock_gettime failed\n", errno);
return errno;
}
to.tv_nsec += timeout_ns;
if (to.tv_nsec >= 1000000000) {
to.tv_sec++;
to.tv_nsec -= 1000000000;
}
info("Calling futex_waitv on f1: %u @ %p with val=%u\n", f1, &f1, f1+1);
res = futex_waitv(&waitv, 1, 0, &to, CLOCK_MONOTONIC);
if (!res || errno != EWOULDBLOCK) {
ksft_test_result_pass("futex_waitv returned: %d %s\n",
res ? errno : res,
res ? strerror(errno) : "");
ret = RET_FAIL;
} else {
ksft_test_result_pass("futex_waitv\n");
}
ksft_print_cnts();
return ret;
}

View file

@ -0,0 +1,237 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/*
* futex_waitv() test by André Almeida <andrealmeid@collabora.com>
*
* Copyright 2021 Collabora Ltd.
*/
#include <errno.h>
#include <error.h>
#include <getopt.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <pthread.h>
#include <stdint.h>
#include <sys/shm.h>
#include "futextest.h"
#include "futex2test.h"
#include "logging.h"
#define TEST_NAME "futex-wait"
#define WAKE_WAIT_US 10000
#define NR_FUTEXES 30
static struct futex_waitv waitv[NR_FUTEXES];
u_int32_t futexes[NR_FUTEXES] = {0};
void usage(char *prog)
{
printf("Usage: %s\n", prog);
printf(" -c Use color\n");
printf(" -h Display this help message\n");
printf(" -v L Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
VQUIET, VCRITICAL, VINFO);
}
void *waiterfn(void *arg)
{
struct timespec to;
int res;
/* setting absolute timeout for futex2 */
if (clock_gettime(CLOCK_MONOTONIC, &to))
error("gettime64 failed\n", errno);
to.tv_sec++;
res = futex_waitv(waitv, NR_FUTEXES, 0, &to, CLOCK_MONOTONIC);
if (res < 0) {
ksft_test_result_fail("futex_waitv returned: %d %s\n",
errno, strerror(errno));
} else if (res != NR_FUTEXES - 1) {
ksft_test_result_fail("futex_waitv returned: %d, expecting %d\n",
res, NR_FUTEXES - 1);
}
return NULL;
}
int main(int argc, char *argv[])
{
pthread_t waiter;
int res, ret = RET_PASS;
struct timespec to;
int c, i;
while ((c = getopt(argc, argv, "cht:v:")) != -1) {
switch (c) {
case 'c':
log_color(1);
break;
case 'h':
usage(basename(argv[0]));
exit(0);
case 'v':
log_verbosity(atoi(optarg));
break;
default:
usage(basename(argv[0]));
exit(1);
}
}
ksft_print_header();
ksft_set_plan(7);
ksft_print_msg("%s: Test FUTEX_WAITV\n",
basename(argv[0]));
for (i = 0; i < NR_FUTEXES; i++) {
waitv[i].uaddr = (uintptr_t)&futexes[i];
waitv[i].flags = FUTEX_32 | FUTEX_PRIVATE_FLAG;
waitv[i].val = 0;
waitv[i].__reserved = 0;
}
/* Private waitv */
if (pthread_create(&waiter, NULL, waiterfn, NULL))
error("pthread_create failed\n", errno);
usleep(WAKE_WAIT_US);
res = futex_wake(u64_to_ptr(waitv[NR_FUTEXES - 1].uaddr), 1, FUTEX_PRIVATE_FLAG);
if (res != 1) {
ksft_test_result_fail("futex_wake private returned: %d %s\n",
res ? errno : res,
res ? strerror(errno) : "");
ret = RET_FAIL;
} else {
ksft_test_result_pass("futex_waitv private\n");
}
/* Shared waitv */
for (i = 0; i < NR_FUTEXES; i++) {
int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666);
if (shm_id < 0) {
perror("shmget");
exit(1);
}
unsigned int *shared_data = shmat(shm_id, NULL, 0);
*shared_data = 0;
waitv[i].uaddr = (uintptr_t)shared_data;
waitv[i].flags = FUTEX_32;
waitv[i].val = 0;
waitv[i].__reserved = 0;
}
if (pthread_create(&waiter, NULL, waiterfn, NULL))
error("pthread_create failed\n", errno);
usleep(WAKE_WAIT_US);
res = futex_wake(u64_to_ptr(waitv[NR_FUTEXES - 1].uaddr), 1, 0);
if (res != 1) {
ksft_test_result_fail("futex_wake shared returned: %d %s\n",
res ? errno : res,
res ? strerror(errno) : "");
ret = RET_FAIL;
} else {
ksft_test_result_pass("futex_waitv shared\n");
}
for (i = 0; i < NR_FUTEXES; i++)
shmdt(u64_to_ptr(waitv[i].uaddr));
/* Testing a waiter without FUTEX_32 flag */
waitv[0].flags = FUTEX_PRIVATE_FLAG;
if (clock_gettime(CLOCK_MONOTONIC, &to))
error("gettime64 failed\n", errno);
to.tv_sec++;
res = futex_waitv(waitv, NR_FUTEXES, 0, &to, CLOCK_MONOTONIC);
if (res == EINVAL) {
ksft_test_result_fail("futex_waitv private returned: %d %s\n",
res ? errno : res,
res ? strerror(errno) : "");
ret = RET_FAIL;
} else {
ksft_test_result_pass("futex_waitv without FUTEX_32\n");
}
/* Testing a waiter with an unaligned address */
waitv[0].flags = FUTEX_PRIVATE_FLAG | FUTEX_32;
waitv[0].uaddr = 1;
if (clock_gettime(CLOCK_MONOTONIC, &to))
error("gettime64 failed\n", errno);
to.tv_sec++;
res = futex_waitv(waitv, NR_FUTEXES, 0, &to, CLOCK_MONOTONIC);
if (res == EINVAL) {
ksft_test_result_fail("futex_wake private returned: %d %s\n",
res ? errno : res,
res ? strerror(errno) : "");
ret = RET_FAIL;
} else {
ksft_test_result_pass("futex_waitv with an unaligned address\n");
}
/* Testing a NULL address for waiters.uaddr */
waitv[0].uaddr = 0x00000000;
if (clock_gettime(CLOCK_MONOTONIC, &to))
error("gettime64 failed\n", errno);
to.tv_sec++;
res = futex_waitv(waitv, NR_FUTEXES, 0, &to, CLOCK_MONOTONIC);
if (res == EINVAL) {
ksft_test_result_fail("futex_waitv private returned: %d %s\n",
res ? errno : res,
res ? strerror(errno) : "");
ret = RET_FAIL;
} else {
ksft_test_result_pass("futex_waitv NULL address in waitv.uaddr\n");
}
/* Testing a NULL address for *waiters */
if (clock_gettime(CLOCK_MONOTONIC, &to))
error("gettime64 failed\n", errno);
to.tv_sec++;
res = futex_waitv(NULL, NR_FUTEXES, 0, &to, CLOCK_MONOTONIC);
if (res == EINVAL) {
ksft_test_result_fail("futex_waitv private returned: %d %s\n",
res ? errno : res,
res ? strerror(errno) : "");
ret = RET_FAIL;
} else {
ksft_test_result_pass("futex_waitv NULL address in *waiters\n");
}
/* Testing an invalid clockid */
if (clock_gettime(CLOCK_MONOTONIC, &to))
error("gettime64 failed\n", errno);
to.tv_sec++;
res = futex_waitv(NULL, NR_FUTEXES, 0, &to, CLOCK_TAI);
if (res == EINVAL) {
ksft_test_result_fail("futex_waitv private returned: %d %s\n",
res ? errno : res,
res ? strerror(errno) : "");
ret = RET_FAIL;
} else {
ksft_test_result_pass("futex_waitv invalid clockid\n");
}
ksft_print_cnts();
return ret;
}

View file

@ -79,3 +79,6 @@ echo
echo
./futex_requeue $COLOR
echo
./futex_waitv $COLOR

View file

@ -0,0 +1,22 @@
/* SPDX-License-Identifier: GPL-2.0-or-later */
/*
* Futex2 library addons for futex tests
*
* Copyright 2021 Collabora Ltd.
*/
#include <stdint.h>
#define u64_to_ptr(x) ((void *)(uintptr_t)(x))
/**
* futex_waitv - Wait at multiple futexes, wake on any
* @waiters: Array of waiters
* @nr_waiters: Length of waiters array
* @flags: Operation flags
* @timo: Optional timeout for operation
*/
static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned long nr_waiters,
unsigned long flags, struct timespec *timo, clockid_t clockid)
{
return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid);
}