From 79e902382637a2f421b7f295dcf9934d80d84d7d Mon Sep 17 00:00:00 2001 From: Juri Lelli Date: Fri, 9 Feb 2018 17:01:14 +0100 Subject: [PATCH 1/5] Documentation/locking/mutex-design: Update to reflect latest changes Commit 3ca0ff571b09 ("locking/mutex: Rework mutex::owner") reworked the basic mutex implementation to deal with several problems. Documentation was however left unchanged and became stale. Update mutex-design.txt to reflect changes introduced by the above commit. Signed-off-by: Juri Lelli Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jonathan Corbet Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/20180209160114.19980-1-juri.lelli@redhat.com [ Small readability tweaks to the text. ] Signed-off-by: Ingo Molnar --- Documentation/locking/mutex-design.txt | 49 +++++++++----------------- 1 file changed, 17 insertions(+), 32 deletions(-) diff --git a/Documentation/locking/mutex-design.txt b/Documentation/locking/mutex-design.txt index 60c482df1a38..818aca19612f 100644 --- a/Documentation/locking/mutex-design.txt +++ b/Documentation/locking/mutex-design.txt @@ -21,37 +21,23 @@ Implementation -------------- Mutexes are represented by 'struct mutex', defined in include/linux/mutex.h -and implemented in kernel/locking/mutex.c. These locks use a three -state atomic counter (->count) to represent the different possible -transitions that can occur during the lifetime of a lock: - - 1: unlocked - 0: locked, no waiters - negative: locked, with potential waiters - -In its most basic form it also includes a wait-queue and a spinlock -that serializes access to it. CONFIG_SMP systems can also include -a pointer to the lock task owner (->owner) as well as a spinner MCS -lock (->osq), both described below in (ii). +and implemented in kernel/locking/mutex.c. These locks use an atomic variable +(->owner) to keep track of the lock state during its lifetime. Field owner +actually contains 'struct task_struct *' to the current lock owner and it is +therefore NULL if not currently owned. Since task_struct pointers are aligned +at at least L1_CACHE_BYTES, low bits (3) are used to store extra state (e.g., +if waiter list is non-empty). In its most basic form it also includes a +wait-queue and a spinlock that serializes access to it. Furthermore, +CONFIG_MUTEX_SPIN_ON_OWNER=y systems use a spinner MCS lock (->osq), described +below in (ii). When acquiring a mutex, there are three possible paths that can be taken, depending on the state of the lock: -(i) fastpath: tries to atomically acquire the lock by decrementing the - counter. If it was already taken by another task it goes to the next - possible path. This logic is architecture specific. On x86-64, the - locking fastpath is 2 instructions: - - 0000000000000e10 : - e21: f0 ff 0b lock decl (%rbx) - e24: 79 08 jns e2e - - the unlocking fastpath is equally tight: - - 0000000000000bc0 : - bc8: f0 ff 07 lock incl (%rdi) - bcb: 7f 0a jg bd7 - +(i) fastpath: tries to atomically acquire the lock by cmpxchg()ing the owner with + the current task. This only works in the uncontended case (cmpxchg() checks + against 0UL, so all 3 state bits above have to be 0). If the lock is + contended it goes to the next possible path. (ii) midpath: aka optimistic spinning, tries to spin for acquisition while the lock owner is running and there are no other tasks ready @@ -143,11 +129,10 @@ Test if the mutex is taken: Disadvantages ------------- -Unlike its original design and purpose, 'struct mutex' is larger than -most locks in the kernel. E.g: on x86-64 it is 40 bytes, almost twice -as large as 'struct semaphore' (24 bytes) and tied, along with rwsems, -for the largest lock in the kernel. Larger structure sizes mean more -CPU cache and memory footprint. +Unlike its original design and purpose, 'struct mutex' is among the largest +locks in the kernel. E.g: on x86-64 it is 32 bytes, where 'struct semaphore' +is 24 bytes and rw_semaphore is 40 bytes. Larger structure sizes mean more CPU +cache and memory footprint. When to use mutexes ------------------- From 95bcade33a8af38755c9b0636e36a36ad3789fe6 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Tue, 13 Feb 2018 13:22:56 +0000 Subject: [PATCH 2/5] locking/qspinlock: Ensure node is initialised before updating prev->next When a locker ends up queuing on the qspinlock locking slowpath, we initialise the relevant mcs node and publish it indirectly by updating the tail portion of the lock word using xchg_tail. If we find that there was a pre-existing locker in the queue, we subsequently update their ->next field to point at our node so that we are notified when it's our turn to take the lock. This can be roughly illustrated as follows: /* Initialise the fields in node and encode a pointer to node in tail */ tail = initialise_node(node); /* * Exchange tail into the lockword using an atomic read-modify-write * operation with release semantics */ old = xchg_tail(lock, tail); /* If there was a pre-existing waiter ... */ if (old & _Q_TAIL_MASK) { prev = decode_tail(old); smp_read_barrier_depends(); /* ... then update their ->next field to point to node. WRITE_ONCE(prev->next, node); } The conditional update of prev->next therefore relies on the address dependency from the result of xchg_tail ensuring order against the prior initialisation of node. However, since the release semantics of the xchg_tail operation apply only to the write portion of the RmW, then this ordering is not guaranteed and it is possible for the CPU to return old before the writes to node have been published, consequently allowing us to point prev->next to an uninitialised node. This patch fixes the problem by making the update of prev->next a RELEASE operation, which also removes the reliance on dependency ordering. Signed-off-by: Will Deacon Acked-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1518528177-19169-2-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar --- kernel/locking/qspinlock.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 38ece035039e..348c8cec1042 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -408,14 +408,15 @@ queue: */ if (old & _Q_TAIL_MASK) { prev = decode_tail(old); - /* - * The above xchg_tail() is also a load of @lock which - * generates, through decode_tail(), a pointer. The address - * dependency matches the RELEASE of xchg_tail() such that - * the subsequent access to @prev happens after. - */ - WRITE_ONCE(prev->next, node); + /* + * We must ensure that the stores to @node are observed before + * the write to prev->next. The address dependency from + * xchg_tail is not sufficient to ensure this because the read + * component of xchg_tail is unordered with respect to the + * initialisation of @node. + */ + smp_store_release(&prev->next, node); pv_wait_node(node, prev); arch_mcs_spin_lock_contended(&node->locked); From 11dc13224c975efcec96647a4768a6f1bb7a19a8 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Tue, 13 Feb 2018 13:22:57 +0000 Subject: [PATCH 3/5] locking/qspinlock: Ensure node->count is updated before initialising node When queuing on the qspinlock, the count field for the current CPU's head node is incremented. This needn't be atomic because locking in e.g. IRQ context is balanced and so an IRQ will return with node->count as it found it. However, the compiler could in theory reorder the initialisation of node[idx] before the increment of the head node->count, causing an IRQ to overwrite the initialised node and potentially corrupt the lock state. Avoid the potential for this harmful compiler reordering by placing a barrier() between the increment of the head node->count and the subsequent node initialisation. Signed-off-by: Will Deacon Acked-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1518528177-19169-3-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar --- kernel/locking/qspinlock.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 348c8cec1042..d880296245c5 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -379,6 +379,14 @@ queue: tail = encode_tail(smp_processor_id(), idx); node += idx; + + /* + * Ensure that we increment the head node->count before initialising + * the actual node. If the compiler is kind enough to reorder these + * stores, then an IRQ could overwrite our assignments. + */ + barrier(); + node->locked = 0; node->next = NULL; pv_init_node(node); From 61e02392d3c7ecac1f91c0a90a8043d67e081846 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Tue, 13 Feb 2018 13:30:19 +0000 Subject: [PATCH 4/5] locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit() A test_and_{}_bit() operation fails if the value of the bit is such that the modification does not take place. For example, if test_and_set_bit() returns 1. In these cases, follow the behaviour of cmpxchg and allow the operation to be unordered. This also applies to test_and_set_bit_lock() if the lock is found to be be taken already. Signed-off-by: Will Deacon Acked-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1518528619-20049-1-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar --- Documentation/atomic_bitops.txt | 7 ++++++- include/asm-generic/bitops/lock.h | 3 ++- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/Documentation/atomic_bitops.txt b/Documentation/atomic_bitops.txt index 5550bfdcce5f..be70b32c95d9 100644 --- a/Documentation/atomic_bitops.txt +++ b/Documentation/atomic_bitops.txt @@ -58,7 +58,12 @@ Like with atomic_t, the rule of thumb is: - RMW operations that have a return value are fully ordered. -Except for test_and_set_bit_lock() which has ACQUIRE semantics and + - RMW operations that are conditional are unordered on FAILURE, + otherwise the above rules apply. In the case of test_and_{}_bit() operations, + if the bit in memory is unchanged by the operation then it is deemed to have + failed. + +Except for a successful test_and_set_bit_lock() which has ACQUIRE semantics and clear_bit_unlock() which has RELEASE semantics. Since a platform only has a single means of achieving atomic operations diff --git a/include/asm-generic/bitops/lock.h b/include/asm-generic/bitops/lock.h index bc397573c43a..67ab280ad134 100644 --- a/include/asm-generic/bitops/lock.h +++ b/include/asm-generic/bitops/lock.h @@ -7,7 +7,8 @@ * @nr: Bit to set * @addr: Address to count from * - * This operation is atomic and provides acquire barrier semantics. + * This operation is atomic and provides acquire barrier semantics if + * the returned value is 0. * It can be used to implement bit locks. */ #define test_and_set_bit_lock(nr, addr) test_and_set_bit(nr, addr) From 2dd6fd2e999774041397f2a7da2e1d30b3a27c3a Mon Sep 17 00:00:00 2001 From: Tycho Andersen Date: Thu, 1 Feb 2018 12:41:19 +0100 Subject: [PATCH 5/5] locking/semaphore: Update the file path in documentation While reading this header I noticed that the locking stuff has moved to kernel/locking/*, so update the path in semaphore.h to point to that. Signed-off-by: Tycho Andersen Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20180201114119.1090-1-tycho@tycho.ws Signed-off-by: Ingo Molnar --- include/linux/semaphore.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/semaphore.h b/include/linux/semaphore.h index dc368b8ce215..11c86fbfeb98 100644 --- a/include/linux/semaphore.h +++ b/include/linux/semaphore.h @@ -4,7 +4,7 @@ * * Distributed under the terms of the GNU GPL, version 2 * - * Please see kernel/semaphore.c for documentation of these functions + * Please see kernel/locking/semaphore.c for documentation of these functions */ #ifndef __LINUX_SEMAPHORE_H #define __LINUX_SEMAPHORE_H