X-Git-Url: http://ftp.safe.ca/?a=blobdiff_plain;f=Documentation%2Fmemory-barriers.txt;h=631ad2f1b229c1d020a9a8b746d8f6cbe72f4531;hb=43c9c59185eec7caaff6e9dd8d4c93a4d9836a86;hp=c61d8b876fdbcba484a3ad149ac3d057dfb76bba;hpb=79afecfaabbe42e2a8a7e28880517f1721f2f3a7;p=safe%2Fjmp%2Flinux-2.6 diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index c61d8b8..631ad2f 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -3,6 +3,7 @@ ============================ By: David Howells + Paul E. McKenney Contents: @@ -19,17 +20,19 @@ Contents: - Control dependencies. - SMP barrier pairing. - Examples of memory barrier sequences. + - Read memory barriers vs load speculation. (*) Explicit kernel barriers. - Compiler barrier. - - The CPU memory barriers. + - CPU memory barriers. - MMIO write barrier. (*) Implicit kernel memory barriers. - Locking functions. - Interrupt disabling functions. + - Sleep and wake-up functions. - Miscellaneous functions. (*) Inter-CPU locking barrier effects. @@ -58,6 +61,10 @@ Contents: - And then there's the Alpha. + (*) Example uses. + + - Circular buffers. + (*) References. @@ -211,7 +218,7 @@ There are some minimal guarantees that may be expected of a CPU: STORE *X = c, d = LOAD *X - (Loads and stores overlap if they are targetted at overlapping pieces of + (Loads and stores overlap if they are targeted at overlapping pieces of memory). And there are a number of things that _must_ or _must_not_ be assumed: @@ -248,7 +255,7 @@ And there are a number of things that _must_ or _must_not_ be assumed: we may get either of: STORE *A = X; Y = LOAD *A; - STORE *A = Y; + STORE *A = Y = X; ========================= @@ -261,9 +268,14 @@ What is required is some way of intervening to instruct the compiler and the CPU to restrict the order. Memory barriers are such interventions. They impose a perceived partial -ordering between the memory operations specified on either side of the barrier. -They request that the sequence of memory events generated appears to other -parts of the system as if the barrier is effective on that CPU. +ordering over the memory operations on either side of the barrier. + +Such enforcement is important because the CPUs and other devices in a system +can use a variety of tricks to improve performance, including reordering, +deferral and combination of memory operations; speculative loads; speculative +branch prediction and various types of caching. Memory barriers are used to +override or suppress these tricks, allowing the code to sanely control the +interaction of multiple CPUs and/or devices. VARIETIES OF MEMORY BARRIER @@ -281,7 +293,7 @@ Memory barriers come in four basic varieties: A write barrier is a partial ordering on stores only; it is not required to have any effect on loads. - A CPU can be viewed as as commiting a sequence of store operations to the + A CPU can be viewed as committing a sequence of store operations to the memory system as time progresses. All stores before a write barrier will occur in the sequence _before_ all the stores after the write barrier. @@ -344,9 +356,12 @@ Memory barriers come in four basic varieties: (4) General memory barriers. - A general memory barrier is a combination of both a read memory barrier - and a write memory barrier. It is a partial ordering over both loads and - stores. + A general memory barrier gives a guarantee that all the LOAD and STORE + operations specified before the barrier will appear to happen before all + the LOAD and STORE operations specified after the barrier with respect to + the other components of the system. + + A general memory barrier is a partial ordering over both loads and stores. General memory barriers imply both read and write memory barriers, and so can substitute for either. @@ -409,7 +424,7 @@ There are certain things that the Linux kernel memory barriers do not guarantee: indirect effect will be the order in which the second CPU sees the effects of the first CPU's accesses occur, but see the next point: - (*) There is no guarantee that the a CPU will see the correct order of effects + (*) There is no guarantee that a CPU will see the correct order of effects from a second CPU's accesses, even _if_ the second CPU uses a memory barrier, unless the first CPU _also_ uses a matching memory barrier (see the subsection on "SMP Barrier Pairing"). @@ -421,8 +436,8 @@ There are certain things that the Linux kernel memory barriers do not guarantee: [*] For information on bus mastering DMA and coherency please read: - Documentation/pci.txt - Documentation/DMA-mapping.txt + Documentation/PCI/pci.txt + Documentation/PCI/PCI-DMA-mapping.txt Documentation/DMA-API.txt @@ -448,7 +463,7 @@ sequence, Q must be either &A or &B, and that: (Q == &A) implies (D == 1) (Q == &B) implies (D == 4) -But! CPU 2's perception of P may be updated _before_ its perception of B, thus +But! CPU 2's perception of P may be updated _before_ its perception of B, thus leading to the following situation: (Q == &B) and (D == 2) ???? @@ -457,8 +472,8 @@ Whilst this may seem like a failure of coherency or causality maintenance, it isn't, and this behaviour can be observed on certain real CPUs (such as the DEC Alpha). -To deal with this, a data dependency barrier must be inserted between the -address load and the data load: +To deal with this, a data dependency barrier or better must be inserted +between the address load and the data load: CPU 1 CPU 2 =============== =============== @@ -480,7 +495,7 @@ lines. The pointer P might be stored in an odd-numbered cache line, and the variable B might be stored in an even-numbered cache line. Then, if the even-numbered bank of the reading CPU's cache is extremely busy while the odd-numbered bank is idle, one can see the new value of the pointer P (&B), -but the old value of the variable B (1). +but the old value of the variable B (2). Another example of where data dependency barriers might by required is where a @@ -546,9 +561,9 @@ write barrier, though, again, a general barrier is viable: =============== =============== a = 1; - b = 2; x = a; + b = 2; x = b; - y = b; + y = a; Or: @@ -563,11 +578,23 @@ Or: Basically, the read barrier always has to be there, even though it can be of the "weaker" type. +[!] Note that the stores before the write barrier would normally be expected to +match the loads after the read barrier or the data dependency barrier, and vice +versa: + + CPU 1 CPU 2 + =============== =============== + a = 1; }---- --->{ v = c + b = 2; } \ / { w = d + \ + c = 3; } / \ { x = a; + d = 4; }---- --->{ y = b; + EXAMPLES OF MEMORY BARRIER SEQUENCES ------------------------------------ -Firstly, write barriers act as a partial orderings on store operations. +Firstly, write barriers act as partial orderings on store operations. Consider the following sequence of events: CPU 1 @@ -581,31 +608,31 @@ Consider the following sequence of events: This sequence of events is committed to the memory coherence system in an order that the rest of the system might perceive as the unordered set of { STORE A, -STORE B, STORE C } all occuring before the unordered set of { STORE D, STORE E +STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E }: +-------+ : : | | +------+ | |------>| C=3 | } /\ - | | : +------+ }----- \ -----> Events perceptible - | | : | A=1 | } \/ to rest of system + | | : +------+ }----- \ -----> Events perceptible to + | | : | A=1 | } \/ the rest of the system | | : +------+ } | CPU 1 | : | B=2 | } | | +------+ } | | wwwwwwwwwwwwwwww } <--- At this point the write barrier | | +------+ } requires all stores prior to the | | : | E=5 | } barrier to be committed before - | | : +------+ } further stores may be take place. + | | : +------+ } further stores may take place | |------>| D=4 | } | | +------+ +-------+ : : | - | Sequence in which stores committed to memory system - | by CPU 1 + | Sequence in which stores are committed to the + | memory system by CPU 1 V -Secondly, data dependency barriers act as a partial orderings on data-dependent +Secondly, data dependency barriers act as partial orderings on data-dependent loads. Consider the following sequence of events: CPU 1 CPU 2 @@ -649,7 +676,7 @@ effectively random order, despite the write barrier issued by CPU 1: In the above example, CPU 2 perceives that B is 7, despite the load of *C -(which would be B) coming after the the LOAD of C. +(which would be B) coming after the LOAD of C. If, however, a data dependency barrier were to be placed between the load of C and the load of *C (ie: B) on CPU 2: @@ -683,14 +710,12 @@ then the following will occur: | : : | | | : : | CPU 2 | | +-------+ | | - \ | X->9 |------>| | - \ +-------+ | | - ----->| B->2 | | | - +-------+ | | - Makes sure all effects ---> ddddddddddddddddd | | - prior to the store of C +-------+ | | - are perceptible to | B->2 |------>| | - successive loads +-------+ | | + | | X->9 |------>| | + | +-------+ | | + Makes sure all effects ---> \ ddddddddddddddddd | | + prior to the store of C \ +-------+ | | + are perceptible to ----->| B->2 |------>| | + subsequent loads +-------+ | | : : +-------+ @@ -699,73 +724,239 @@ following sequence of events: CPU 1 CPU 2 ======================= ======================= + { A = 0, B = 9 } STORE A=1 - STORE B=2 - STORE C=3 - STORE D=4 - STORE E=5 - LOAD A + STORE B=2 LOAD B - LOAD C - LOAD D - LOAD E + LOAD A Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in some effectively random order, despite the write barrier issued by CPU 1: - +-------+ : : - | | +------+ - | |------>| C=3 | } - | | : +------+ } - | | : | A=1 | } - | | : +------+ } - | CPU 1 | : | B=2 | }--- - | | +------+ } \ - | | wwwwwwwwwwwww} \ - | | +------+ } \ : : +-------+ - | | : | E=5 | } \ +-------+ | | - | | : +------+ } \ { | C->3 |------>| | - | |------>| D=4 | } \ { +-------+ : | | - | | +------+ \ { | E->5 | : | | - +-------+ : : \ { +-------+ : | | - Transfer -->{ | A->1 | : | CPU 2 | - from CPU 1 { +-------+ : | | - to CPU 2 { | D->4 | : | | - { +-------+ : | | - { | B->2 |------>| | - +-------+ | | - : : +-------+ - - -If, however, a read barrier were to be placed between the load of C and the -load of D on CPU 2, then the partial ordering imposed by CPU 1 will be -perceived correctly by CPU 2. + +-------+ : : : : + | | +------+ +-------+ + | |------>| A=1 |------ --->| A->0 | + | | +------+ \ +-------+ + | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | + | | +------+ | +-------+ + | |------>| B=2 |--- | : : + | | +------+ \ | : : +-------+ + +-------+ : : \ | +-------+ | | + ---------->| B->2 |------>| | + | +-------+ | CPU 2 | + | | A->0 |------>| | + | +-------+ | | + | : : +-------+ + \ : : + \ +-------+ + ---->| A->1 | + +-------+ + : : - +-------+ : : - | | +------+ - | |------>| C=3 | } - | | : +------+ } - | | : | A=1 | }--- - | | : +------+ } \ - | CPU 1 | : | B=2 | } \ - | | +------+ \ - | | wwwwwwwwwwwwwwww \ - | | +------+ \ : : +-------+ - | | : | E=5 | } \ +-------+ | | - | | : +------+ }--- \ { | C->3 |------>| | - | |------>| D=4 | } \ \ { +-------+ : | | - | | +------+ \ -->{ | B->2 | : | | - +-------+ : : \ { +-------+ : | | - \ { | A->1 | : | CPU 2 | - \ +-------+ | | - At this point the read ----> \ rrrrrrrrrrrrrrrrr | | - barrier causes all effects \ +-------+ | | - prior to the storage of C \ { | E->5 | : | | - to be perceptible to CPU 2 -->{ +-------+ : | | - { | D->4 |------>| | - +-------+ | | - : : +-------+ + +If, however, a read barrier were to be placed between the load of B and the +load of A on CPU 2: + + CPU 1 CPU 2 + ======================= ======================= + { A = 0, B = 9 } + STORE A=1 + + STORE B=2 + LOAD B + + LOAD A + +then the partial ordering imposed by CPU 1 will be perceived correctly by CPU +2: + + +-------+ : : : : + | | +------+ +-------+ + | |------>| A=1 |------ --->| A->0 | + | | +------+ \ +-------+ + | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | + | | +------+ | +-------+ + | |------>| B=2 |--- | : : + | | +------+ \ | : : +-------+ + +-------+ : : \ | +-------+ | | + ---------->| B->2 |------>| | + | +-------+ | CPU 2 | + | : : | | + | : : | | + At this point the read ----> \ rrrrrrrrrrrrrrrrr | | + barrier causes all effects \ +-------+ | | + prior to the storage of B ---->| A->1 |------>| | + to be perceptible to CPU 2 +-------+ | | + : : +-------+ + + +To illustrate this more completely, consider what could happen if the code +contained a load of A either side of the read barrier: + + CPU 1 CPU 2 + ======================= ======================= + { A = 0, B = 9 } + STORE A=1 + + STORE B=2 + LOAD B + LOAD A [first load of A] + + LOAD A [second load of A] + +Even though the two loads of A both occur after the load of B, they may both +come up with different values: + + +-------+ : : : : + | | +------+ +-------+ + | |------>| A=1 |------ --->| A->0 | + | | +------+ \ +-------+ + | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | + | | +------+ | +-------+ + | |------>| B=2 |--- | : : + | | +------+ \ | : : +-------+ + +-------+ : : \ | +-------+ | | + ---------->| B->2 |------>| | + | +-------+ | CPU 2 | + | : : | | + | : : | | + | +-------+ | | + | | A->0 |------>| 1st | + | +-------+ | | + At this point the read ----> \ rrrrrrrrrrrrrrrrr | | + barrier causes all effects \ +-------+ | | + prior to the storage of B ---->| A->1 |------>| 2nd | + to be perceptible to CPU 2 +-------+ | | + : : +-------+ + + +But it may be that the update to A from CPU 1 becomes perceptible to CPU 2 +before the read barrier completes anyway: + + +-------+ : : : : + | | +------+ +-------+ + | |------>| A=1 |------ --->| A->0 | + | | +------+ \ +-------+ + | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | + | | +------+ | +-------+ + | |------>| B=2 |--- | : : + | | +------+ \ | : : +-------+ + +-------+ : : \ | +-------+ | | + ---------->| B->2 |------>| | + | +-------+ | CPU 2 | + | : : | | + \ : : | | + \ +-------+ | | + ---->| A->1 |------>| 1st | + +-------+ | | + rrrrrrrrrrrrrrrrr | | + +-------+ | | + | A->1 |------>| 2nd | + +-------+ | | + : : +-------+ + + +The guarantee is that the second load will always come up with A == 1 if the +load of B came up with B == 2. No such guarantee exists for the first load of +A; that may come up with either A == 0 or A == 1. + + +READ MEMORY BARRIERS VS LOAD SPECULATION +---------------------------------------- + +Many CPUs speculate with loads: that is they see that they will need to load an +item from memory, and they find a time where they're not using the bus for any +other loads, and so do the load in advance - even though they haven't actually +got to that point in the instruction execution flow yet. This permits the +actual load instruction to potentially complete immediately because the CPU +already has the value to hand. + +It may turn out that the CPU didn't actually need the value - perhaps because a +branch circumvented the load - in which case it can discard the value or just +cache it for later use. + +Consider: + + CPU 1 CPU 2 + ======================= ======================= + LOAD B + DIVIDE } Divide instructions generally + DIVIDE } take a long time to perform + LOAD A + +Which might appear as this: + + : : +-------+ + +-------+ | | + --->| B->2 |------>| | + +-------+ | CPU 2 | + : :DIVIDE | | + +-------+ | | + The CPU being busy doing a ---> --->| A->0 |~~~~ | | + division speculates on the +-------+ ~ | | + LOAD of A : : ~ | | + : :DIVIDE | | + : : ~ | | + Once the divisions are complete --> : : ~-->| | + the CPU can then perform the : : | | + LOAD with immediate effect : : +-------+ + + +Placing a read barrier or a data dependency barrier just before the second +load: + + CPU 1 CPU 2 + ======================= ======================= + LOAD B + DIVIDE + DIVIDE + + LOAD A + +will force any value speculatively obtained to be reconsidered to an extent +dependent on the type of barrier used. If there was no change made to the +speculated memory location, then the speculated value will just be used: + + : : +-------+ + +-------+ | | + --->| B->2 |------>| | + +-------+ | CPU 2 | + : :DIVIDE | | + +-------+ | | + The CPU being busy doing a ---> --->| A->0 |~~~~ | | + division speculates on the +-------+ ~ | | + LOAD of A : : ~ | | + : :DIVIDE | | + : : ~ | | + : : ~ | | + rrrrrrrrrrrrrrrr~ | | + : : ~ | | + : : ~-->| | + : : | | + : : +-------+ + + +but if there was an update or an invalidation from another CPU pending, then +the speculation will be cancelled and the value reloaded: + + : : +-------+ + +-------+ | | + --->| B->2 |------>| | + +-------+ | CPU 2 | + : :DIVIDE | | + +-------+ | | + The CPU being busy doing a ---> --->| A->0 |~~~~ | | + division speculates on the +-------+ ~ | | + LOAD of A : : ~ | | + : :DIVIDE | | + : : ~ | | + : : ~ | | + rrrrrrrrrrrrrrrrr | | + +-------+ | | + The speculation is discarded ---> --->| A->1 |------>| | + and an updated value is +-------+ | | + retrieved : : +-------+ ======================== @@ -790,7 +981,7 @@ compiler from moving the memory accesses either side of it to the other side: barrier(); -This a general barrier - lesser varieties of compiler barrier do not exist. +This is a general barrier - lesser varieties of compiler barrier do not exist. The compiler barrier has no direct effect on the CPU, which may then reorder things however it wishes. @@ -809,10 +1000,20 @@ The Linux kernel has eight basic CPU memory barriers: DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends() -All CPU memory barriers unconditionally imply compiler barriers. +All memory barriers except the data dependency barriers imply a compiler +barrier. Data dependencies do not impose any additional compiler ordering. + +Aside: In the case of data dependencies, the compiler would be expected to +issue the loads in the correct order (eg. `a[b]` would have to load the value +of b before loading a[b]), however there is no guarantee in the C specification +that the compiler may not speculate the value of b (eg. is equal to 1) and load +a before b (eg. tmp = a[1]; if (b != 1) tmp = a[b]; ). There is also the +problem of a compiler reloading b after having loaded a[b], thus having a newer +copy of b than a[b]. A consensus has not yet been reached about these problems, +however the ACCESS_ONCE macro is a good place to start looking. SMP memory barriers are reduced to compiler barriers on uniprocessor compiled -systems because it is assumed that a CPU will be appear to be self-consistent, +systems because it is assumed that a CPU will appear to be self-consistent, and will order overlapping accesses correctly with respect to itself. [!] Note that SMP memory barriers _must_ be used to control the ordering of @@ -830,10 +1031,9 @@ CPU from reordering them. There are some more advanced barrier functions: (*) set_mb(var, value) - (*) set_wmb(var, value) - These assign the value to the variable and then insert at least a write - barrier after it, depending on the function. They aren't guaranteed to + This assigns the value to the variable and then inserts a full memory + barrier after it, depending on the function. It isn't guaranteed to insert anything more than a compiler barrier in a UP compilation. @@ -901,7 +1101,7 @@ IMPLICIT KERNEL MEMORY BARRIERS =============================== Some of the other functions in the linux kernel imply memory barriers, amongst -which are locking, scheduling and memory allocation functions. +which are locking and scheduling functions. This specification is a _minimum_ guarantee; any particular architecture may provide more substantial guarantees, but these may not be relied upon outside @@ -962,9 +1162,23 @@ for each construct. These operations all imply certain barriers: Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is equivalent to a full barrier, but a LOCK followed by an UNLOCK is not. -[!] Note: one of the consequence of LOCKs and UNLOCKs being only one-way - barriers is that the effects instructions outside of a critical section may - seep into the inside of the critical section. +[!] Note: one of the consequences of LOCKs and UNLOCKs being only one-way + barriers is that the effects of instructions outside of a critical section + may seep into the inside of the critical section. + +A LOCK followed by an UNLOCK may not be assumed to be full memory barrier +because it is possible for an access preceding the LOCK to happen after the +LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the +two accesses can themselves then cross: + + *A = a; + LOCK + UNLOCK + *B = b; + +may occur as: + + LOCK, STORE *B, STORE *A, UNLOCK Locks and semaphores may not provide any guarantee of ordering on UP compiled systems, and so cannot be counted on in such a situation to actually achieve @@ -1009,6 +1223,132 @@ barriers are required in such a situation, they must be provided from some other means. +SLEEP AND WAKE-UP FUNCTIONS +--------------------------- + +Sleeping and waking on an event flagged in global data can be viewed as an +interaction between two pieces of data: the task state of the task waiting for +the event and the global data used to indicate the event. To make sure that +these appear to happen in the right order, the primitives to begin the process +of going to sleep, and the primitives to initiate a wake up imply certain +barriers. + +Firstly, the sleeper normally follows something like this sequence of events: + + for (;;) { + set_current_state(TASK_UNINTERRUPTIBLE); + if (event_indicated) + break; + schedule(); + } + +A general memory barrier is interpolated automatically by set_current_state() +after it has altered the task state: + + CPU 1 + =============================== + set_current_state(); + set_mb(); + STORE current->state + + LOAD event_indicated + +set_current_state() may be wrapped by: + + prepare_to_wait(); + prepare_to_wait_exclusive(); + +which therefore also imply a general memory barrier after setting the state. +The whole sequence above is available in various canned forms, all of which +interpolate the memory barrier in the right place: + + wait_event(); + wait_event_interruptible(); + wait_event_interruptible_exclusive(); + wait_event_interruptible_timeout(); + wait_event_killable(); + wait_event_timeout(); + wait_on_bit(); + wait_on_bit_lock(); + + +Secondly, code that performs a wake up normally follows something like this: + + event_indicated = 1; + wake_up(&event_wait_queue); + +or: + + event_indicated = 1; + wake_up_process(event_daemon); + +A write memory barrier is implied by wake_up() and co. if and only if they wake +something up. The barrier occurs before the task state is cleared, and so sits +between the STORE to indicate the event and the STORE to set TASK_RUNNING: + + CPU 1 CPU 2 + =============================== =============================== + set_current_state(); STORE event_indicated + set_mb(); wake_up(); + STORE current->state + STORE current->state + LOAD event_indicated + +The available waker functions include: + + complete(); + wake_up(); + wake_up_all(); + wake_up_bit(); + wake_up_interruptible(); + wake_up_interruptible_all(); + wake_up_interruptible_nr(); + wake_up_interruptible_poll(); + wake_up_interruptible_sync(); + wake_up_interruptible_sync_poll(); + wake_up_locked(); + wake_up_locked_poll(); + wake_up_nr(); + wake_up_poll(); + wake_up_process(); + + +[!] Note that the memory barriers implied by the sleeper and the waker do _not_ +order multiple stores before the wake-up with respect to loads of those stored +values after the sleeper has called set_current_state(). For instance, if the +sleeper does: + + set_current_state(TASK_INTERRUPTIBLE); + if (event_indicated) + break; + __set_current_state(TASK_RUNNING); + do_something(my_data); + +and the waker does: + + my_data = value; + event_indicated = 1; + wake_up(&event_wait_queue); + +there's no guarantee that the change to event_indicated will be perceived by +the sleeper as coming after the change to my_data. In such a circumstance, the +code on both sides must interpolate its own memory barriers between the +separate data accesses. Thus the above sleeper ought to do: + + set_current_state(TASK_INTERRUPTIBLE); + if (event_indicated) { + smp_rmb(); + do_something(my_data); + } + +and the waker should do: + + my_data = value; + smp_wmb(); + event_indicated = 1; + wake_up(&event_wait_queue); + + MISCELLANEOUS FUNCTIONS ----------------------- @@ -1016,8 +1356,6 @@ Other functions that imply barriers: (*) schedule() and similar imply full memory barriers. - (*) Memory allocation and release functions imply full memory barriers. - ================================= INTER-CPU LOCKING BARRIER EFFECTS @@ -1043,7 +1381,7 @@ three CPUs; then should the following sequence of events occur: UNLOCK M UNLOCK Q *D = d; *H = h; -Then there is no guarantee as to what order CPU #3 will see the accesses to *A +Then there is no guarantee as to what order CPU 3 will see the accesses to *A through *H occur in, other than the constraints imposed by the separate locks on the separate CPUs. It might, for example, see: @@ -1073,12 +1411,12 @@ However, if the following occurs: UNLOCK M [2] *H = h; -CPU #3 might see: +CPU 3 might see: *E, LOCK M [1], *C, *B, *A, UNLOCK M [1], LOCK M [2], *H, *F, *G, UNLOCK M [2], *D -But assuming CPU #1 gets the lock first, it won't see any of: +But assuming CPU 1 gets the lock first, CPU 3 won't see any of: *B, *C, *D, *F, *G or *H preceding LOCK M [1] *A, *B or *C following UNLOCK M [1] @@ -1131,12 +1469,12 @@ spinlock, for example: mmiowb(); spin_unlock(Q); -this will ensure that the two stores issued on CPU #1 appear at the PCI bridge -before either of the stores issued on CPU #2. +this will ensure that the two stores issued on CPU 1 appear at the PCI bridge +before either of the stores issued on CPU 2. -Furthermore, following a store by a load to the same device obviates the need -for an mmiowb(), because the load forces the store to complete before the load +Furthermore, following a store by a load from the same device obviates the need +for the mmiowb(), because the load forces the store to complete before the load is performed: CPU 1 CPU 2 @@ -1160,14 +1498,14 @@ WHERE ARE MEMORY BARRIERS NEEDED? Under normal operation, memory operation reordering is generally not going to be a problem as a single-threaded linear piece of code will still appear to -work correctly, even if it's in an SMP kernel. There are, however, three +work correctly, even if it's in an SMP kernel. There are, however, four circumstances in which reordering definitely _could_ be a problem: (*) Interprocessor interaction. (*) Atomic operations. - (*) Accessing devices (I/O). + (*) Accessing devices. (*) Interrupts. @@ -1203,7 +1541,7 @@ To wake up a particular waiter, the up_read() or up_write() functions have to: (1) read the next pointer from this waiter's record to know as to where the next waiter record is; - (4) read the pointer to the waiter's task structure; + (2) read the pointer to the waiter's task structure; (3) clear the task pointer to tell the waiter it has been given the semaphore; @@ -1211,7 +1549,7 @@ To wake up a particular waiter, the up_read() or up_write() functions have to: (5) release the reference held on the waiter's task struct. -In otherwords, it has to perform this sequence of events: +In other words, it has to perform this sequence of events: LOAD waiter->list.next; LOAD waiter->task; @@ -1269,9 +1607,8 @@ instruction itself is complete. On a UP system - where this wouldn't be a problem - the smp_mb() is just a compiler barrier, thus making sure the compiler emits the instructions in the -right order without actually intervening in the CPU. Since there there's only -one CPU, that CPU's dependency ordering logic will take care of everything -else. +right order without actually intervening in the CPU. Since there's only one +CPU, that CPU's dependency ordering logic will take care of everything else. ATOMIC OPERATIONS @@ -1284,7 +1621,8 @@ kernel. Any atomic operation that modifies some state in memory and returns information about the state (old or new) implies an SMP-conditional general memory barrier -(smp_mb()) on each side of the actual operation. These include: +(smp_mb()) on each side of the actual operation (with the exception of +explicit lock operations, described later). These include: xchg(); cmpxchg(); @@ -1297,7 +1635,7 @@ about the state (old or new) implies an SMP-conditional general memory barrier atomic_dec_and_test(); atomic_sub_and_test(); atomic_add_negative(); - atomic_add_unless(); + atomic_add_unless(); /* when succeeds (returns 1) */ test_and_set_bit(); test_and_clear_bit(); test_and_change_bit(); @@ -1307,7 +1645,7 @@ operations and adjusting reference counters towards object destruction, and as such the implicit memory barrier effects are necessary. -The following operation are potential problems as they do _not_ imply memory +The following operations are potential problems as they do _not_ imply memory barriers, but might be used for implementing such things as UNLOCK-class operations: @@ -1322,7 +1660,7 @@ With these the appropriate explicit memory barrier should be used if necessary The following also do _not_ imply memory barriers, and so may require explicit memory barriers under some circumstances (smp_mb__before_atomic_dec() for -instance)): +instance): atomic_add(); atomic_sub(); @@ -1341,10 +1679,19 @@ If they're used for constructing a lock of some description, then they probably do need memory barriers as a lock primitive generally has to do things in a specific order. - Basically, each usage case has to be carefully considered as to whether memory barriers are needed or not. +The following operations are special locking primitives: + + test_and_set_bit_lock(); + clear_bit_unlock(); + __clear_bit_unlock(); + +These implement LOCK-class and UNLOCK-class operations. These should be used in +preference to other operations when implementing locking primitives, because +their implementations can be optimised on many architectures. + [!] Note that special memory barrier primitives are available for these situations because on some CPUs the atomic instructions used imply full memory barriers, and so barrier instructions are superfluous in conjunction with them, @@ -1446,11 +1793,11 @@ functions: indeed have special I/O space access cycles and instructions, but many CPUs don't have such a concept. - The PCI bus, amongst others, defines an I/O space concept - which on such - CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O - space. However, it may also mapped as a virtual I/O space in the CPU's - memory map, particularly on those CPUs that don't support alternate - I/O spaces. + The PCI bus, amongst others, defines an I/O space concept which - on such + CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O + space. However, it may also be mapped as a virtual I/O space in the CPU's + memory map, particularly on those CPUs that don't support alternate I/O + spaces. Accesses to this space may be fully synchronous (as on i386), but intermediary bridges (such as the PCI host bridge) may not fully honour @@ -1469,7 +1816,7 @@ functions: i386 architecture machines, for example, this is controlled by way of the MTRR registers. - Ordinarily, these will be guaranteed to be fully ordered and uncombined,, + Ordinarily, these will be guaranteed to be fully ordered and uncombined, provided they're not accessing a prefetchable device. However, intermediary hardware (such as a PCI bridge) may indulge in @@ -1494,7 +1841,7 @@ functions: (*) ioreadX(), iowriteX() - These will perform as appropriate for the type of access they're actually + These will perform appropriately for the type of access they're actually doing, be it inX()/outX() or readX()/writeX(). @@ -1510,7 +1857,7 @@ of arch-specific code. This means that it must be considered that the CPU will execute its instruction stream in any order it feels like - or even in parallel - provided that if an -instruction in the stream depends on the an earlier instruction, then that +instruction in the stream depends on an earlier instruction, then that earlier instruction must be sufficiently complete[*] before the later instruction may proceed; in other words: provided that the appearance of causality is maintained. @@ -1600,8 +1947,8 @@ eventually become visible on all CPUs, there's no guarantee that they will become apparent in the same order on those other CPUs. -Consider dealing with a system that has pair of CPUs (1 & 2), each of which has -a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D): +Consider dealing with a system that has a pair of CPUs (1 & 2), each of which +has a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D): : : +--------+ @@ -1640,7 +1987,7 @@ Imagine the system has the following properties: (*) the coherency queue is not flushed by normal loads to lines already present in the cache, even though the contents of the queue may - potentially effect those loads. + potentially affect those loads. Imagine, then, that two writes are made on the first CPU, with a write barrier between them to guarantee that they will appear to reach that CPU's caches in @@ -1650,7 +1997,7 @@ the requisite order: =============== =============== ======================================= u == 0, v == 1 and p == &u, q == &u v = 2; - smp_wmb(); Make sure change to v visible before + smp_wmb(); Make sure change to v is visible before change to p v is now in cache A exclusively p = &v; @@ -1658,7 +2005,7 @@ the requisite order: The write memory barrier forces the other CPUs in the system to perceive that the local CPU's caches have apparently been updated in the correct order. But -now imagine that the second CPU that wants to read those values: +now imagine that the second CPU wants to read those values: CPU 1 CPU 2 COMMENT =============== =============== ======================================= @@ -1666,7 +2013,7 @@ now imagine that the second CPU that wants to read those values: q = p; x = *q; -The above pair of reads may then fail to happen in expected order, as the +The above pair of reads may then fail to happen in the expected order, as the cacheline holding p may get updated in one of the second CPU's caches whilst the update to the cacheline holding v is delayed in the other of the second CPU's caches by some other cache event: @@ -1703,7 +2050,7 @@ queue before processing any further requests: smp_wmb(); - p = &b; q = p; + p = &v; q = p; @@ -1720,8 +2067,8 @@ Whilst most CPUs do imply a data dependency barrier on the read when a memory access depends on a read, not all do, so it may not be relied on. Other CPUs may also have split caches, but must coordinate between the various -cachelets for normal memory accesss. The semantics of the Alpha removes the -need for coordination in absence of memory barriers. +cachelets for normal memory accesses. The semantics of the Alpha removes the +need for coordination in the absence of memory barriers. CACHE COHERENCY VS DMA @@ -1736,10 +2083,10 @@ invalidate them as well). In addition, the data DMA'd to RAM by a device may be overwritten by dirty cache lines being written back to RAM from a CPU's cache after the device has -installed its own data, or cache lines simply present in a CPUs cache may -simply obscure the fact that RAM has been updated, until at such time as the -cacheline is discarded from the CPU's cache and reloaded. To deal with this, -the appropriate part of the kernel must invalidate the overlapping bits of the +installed its own data, or cache lines present in the CPU's cache may simply +obscure the fact that RAM has been updated, until at such time as the cacheline +is discarded from the CPU's cache and reloaded. To deal with this, the +appropriate part of the kernel must invalidate the overlapping bits of the cache on each CPU. See Documentation/cachetlb.txt for more information on cache management. @@ -1749,7 +2096,7 @@ CACHE COHERENCY VS MMIO ----------------------- Memory mapped I/O usually takes place through memory locations that are part of -a window in the CPU's memory space that have different properties assigned than +a window in the CPU's memory space that has different properties assigned than the usual RAM directed window. Amongst these properties is usually the fact that such accesses bypass the @@ -1765,7 +2112,7 @@ THE THINGS CPUS GET UP TO ========================= A programmer might take it for granted that the CPU will perform memory -operations in exactly the order specified, so that if a CPU is, for example, +operations in exactly the order specified, so that if the CPU is, for example, given the following piece of code to execute: a = *A; @@ -1774,7 +2121,7 @@ given the following piece of code to execute: d = *D; *E = e; -They would then expect that the CPU will complete the memory operation for each +they would then expect that the CPU will complete the memory operation for each instruction before moving on to the next one, leading to a definite sequence of operations as seen by external observers in the system: @@ -1791,8 +2138,8 @@ assumption doesn't hold because: (*) loads may be done speculatively, and the result discarded should it prove to have been unnecessary; - (*) loads may be done speculatively, leading to the result having being - fetched at the wrong time in the expected sequence of events; + (*) loads may be done speculatively, leading to the result having been fetched + at the wrong time in the expected sequence of events; (*) the order of the memory accesses may be rearranged to promote better use of the CPU buses and caches; @@ -1874,16 +2221,31 @@ AND THEN THERE'S THE ALPHA The DEC Alpha CPU is one of the most relaxed CPUs there is. Not only that, some versions of the Alpha CPU have a split data cache, permitting them to have -two semantically related cache lines updating at separate times. This is where +two semantically-related cache lines updated at separate times. This is where the data dependency barrier really becomes necessary as this synchronises both caches with the memory coherence system, thus making it seem like pointer changes vs new data occur in the right order. -The Alpha defines the Linux's kernel's memory barrier model. +The Alpha defines the Linux kernel's memory barrier model. See the subsection on "Cache Coherency" above. +============ +EXAMPLE USES +============ + +CIRCULAR BUFFERS +---------------- + +Memory barriers can be used to implement circular buffering without the need +of a lock to serialise the producer with the consumer. See: + + Documentation/circular-buffers.txt + +for details. + + ========== REFERENCES ==========