|  | memory_barriers.txt | 
|  | Barret Rhoden | 
|  |  | 
|  | 1. Overview | 
|  | 2. General Rules | 
|  | 3. Use in the Code Base | 
|  | 4. Memory Barriers and Locking | 
|  | 5. Other Stuff | 
|  |  | 
|  | 1. Overview | 
|  | ==================== | 
|  | Memory barriers exist to make sure the compiler and the CPU do what we intend. | 
|  | The compiler memory barrier (cmb()) (called an optimization barrier in linux) | 
|  | prevents the compliler from reordering operations.  However, CPUs can also | 
|  | reorder reads and writes, in an architecture-dependent manner.  In most places | 
|  | with shared memory synchronization, you'll need some form of memory barriers. | 
|  |  | 
|  | These barriers apply to 'unrelated' reads and writes.  The compiler or the CPU | 
|  | cannot detect any relationship between them, so it believes it is safe to | 
|  | reorder them.  The problem arises in that we attach some meaning to them, | 
|  | often in the form of signalling other cores. | 
|  |  | 
|  | CPU memory barriers only apply when talking to different cores or hardware | 
|  | devices.  They do not matter when dealing with your own core (perhaps between | 
|  | a uthread and vcore context, running on the same core).  cmb()s still matter, | 
|  | even when the synchronizing code runs on the same core.  See Section 3 for | 
|  | more details. | 
|  |  | 
|  | 2. General Rules | 
|  | ==================== | 
|  | 2.1: Types of Memory Barriers | 
|  | --------------------- | 
|  | For CPU memory barriers, we have 5 types. | 
|  | - rmb() no reordering of reads with future reads | 
|  | - wmb() no reordering of writes with future writes | 
|  | - wrmb() no reordering of writes with future reads | 
|  | - rwmb() no reordering of reads with future writes | 
|  | - mb() no reordering of reads or writes with future reads or writes | 
|  |  | 
|  | All 5 of these have a cmb() built in. (a "memory" clobber). | 
|  |  | 
|  | Part of the reason for the distinction between wrmb/rwmb and the full mb is | 
|  | that on many machines (x86), rwmb() is a noop (for us). | 
|  |  | 
|  | These barriers are used on 'normal' reads and writes, and they do not include | 
|  | streaming/SSE instructions and string ops (on x86), and they do not include | 
|  | talking with hardware devices.  For memory barriers for those types of | 
|  | operations, use the _f variety (force), e.g. rmb() -> rmb_f(). | 
|  |  | 
|  | 2.2: Atomics | 
|  | --------------------- | 
|  | Most atomic operations, such as atomic_inc(), provide some form of memory | 
|  | barrier protection.  Specifically, all read-modify-write (RMW) atomic ops act | 
|  | as a CPU memory barrier (like an mb()), but do *not* provide a cmb().  They | 
|  | only provide a cmb() on the variables they apply to (i.e., variables in the | 
|  | clobber list). | 
|  |  | 
|  | I considered making all atomics clobber "memory" (like the cmb()), but | 
|  | sync_fetch_and_add() and friends do not do this by default, and it also means | 
|  | that any use of atomics (even when we don't require a cmb()) then provides a | 
|  | cmb(). | 
|  |  | 
|  | Also note that not all atomic operations are RMW.  atomic_set(), _init(), and | 
|  | _read() do not enforce a memory barrier in the CPU.  If in doubt, look for the | 
|  | LOCK in the assembly (except for xchg, which is a locking function).  We're | 
|  | taking advantage of the LOCK the atomics provide to serialize and synchronize | 
|  | our memory. | 
|  |  | 
|  | In a lot of places, even if we have an atomic I'll still put in the expected | 
|  | mb (e.g., a rmb()), especially if it clarifies the code.  When I rely on the | 
|  | atomic's LOCK, I'll make a note of it (such as in spinlock code). | 
|  |  | 
|  | Finally, note that the atomic RMWs handle the mb_f()s just like they handle | 
|  | the regular memory barriers.  The LOCK prefix does quite a bit. | 
|  |  | 
|  | These rules are a bit x86 specific, so for now other architectures will need | 
|  | to implement their atomics such that they provide the same effects. | 
|  |  | 
|  | 2.3: Locking | 
|  | --------------------- | 
|  | If you access shared memory variables only when inside locks, then you do not | 
|  | need to worry about memory barriers.  The details are sorted out in the | 
|  | locking code. | 
|  |  | 
|  | 3. Use in the Code Base | 
|  | ==================== | 
|  | Figuring out how / when / if to use memory barriers is relatively easy. | 
|  | - First, notice when  you are using shared memory to synchronize.  Any time | 
|  | you are using a global variable or working on structs that someone else | 
|  | might look at concurrently, and you aren't using locks or another vetted | 
|  | tool (like krefs), then you need to worry about this. | 
|  | - Second, determine what reads and writes you are doing. | 
|  | - Third, determine who you are talking to. | 
|  |  | 
|  | If you're talking to other cores or devices, you need CPU mbs.  If not, a cmb | 
|  | suffices.  Based on the types of reads and writes you are doing, just pick one | 
|  | of the 5 memory barriers. | 
|  |  | 
|  | 3.1: What's a Read or Write? | 
|  | --------------------- | 
|  | When writing code that synchronizes with other threads via shared memory, we | 
|  | have a variety of patterns.  Most infamous is the "signal, then check if the | 
|  | receiver is still listening", which is the critical part of the "check, | 
|  | signal, check again" pattern.  For examples, look at things like | 
|  | 'notif_pending' and when we check VC_CAN_RCV_MSG in event.c. | 
|  |  | 
|  | In these examples, "write" and "read" include things such as posting events or | 
|  | checking flags (which ultimately involve writes and reads).  You need to be | 
|  | aware that some functions, such as TAILQ_REMOVE are writes, even though it is | 
|  | not written as *x = 3;.  Whatever your code is, you should be able to point | 
|  | out what are the critical variables and their interleavings.  You should see | 
|  | how a CPU reordering would break your algorithm just like a poorly timed | 
|  | interrupt or other concurrent interleaving. | 
|  |  | 
|  | When looking at a function that we consider a signal/write, you need to know | 
|  | what it does.  It might handle its memory barriers internally (protecting its | 
|  | own memory operations).  However, it might not.  In general, I err on the side | 
|  | of extra mbs, or at the very least leave a comment about what I am doing and | 
|  | what sort of barriers I desire from the code. | 
|  |  | 
|  | 3.2: Who Are We Talking To? | 
|  | --------------------- | 
|  | CPU memory barriers are necessary when synchronizing/talking with remote cores | 
|  | or devices, but not when "talking" with your own core.  For instance, if you | 
|  | issue two writes, then read them, you will see both writes (reads may not be | 
|  | reorderd with older writes to the same location on a single processor, and | 
|  | your reads get served out of the write buffer).  Note, the read can | 
|  | pass/happen before the write, but the CPU gives you the correct value that the | 
|  | write gave you (store-buffer forwarding).  Other cores may not see both of | 
|  | them due to reordering magic.  Put another way, since those remote processors | 
|  | didn't do the initial writing, the rule about reading the same location | 
|  | doesn't apply to them. | 
|  |  | 
|  | Given this, when finding spots in the code that may require a mb(), I think | 
|  | about signalling a concurrent viewer on a different core.  A classic example | 
|  | is when we signal to say "process an item".  Say we were on one core and | 
|  | filled the struct out and then signalled, if we then started reading from that | 
|  | same core, we would see our old write (you see writes you previously wrote), | 
|  | but someone else on another core may see the signal before the filled out | 
|  | struct. | 
|  |  | 
|  | There is still a distinction between the compiler reordering code and the | 
|  | processor reordering code.  Imagine the case of "filling a struct, then | 
|  | posting the pointer to the struct".  If the compiler reorders, the pointer may | 
|  | be posted before the struct is filled, and an interrupt may come in.  The | 
|  | interrupt handler may look at that structure and see invalid data.  This has | 
|  | nothing to do with the CPU reordering - simply the compiler messed with you. | 
|  | Note this only matters if we care about a concurrent interleaving (interrupt | 
|  | handler with a kthread for instance), and we ought to be aware that there is | 
|  | some shared state being mucked with. | 
|  |  | 
|  | For a more complicated example, consider DONT_MIGRATE and reading vcoreid. | 
|  | Logically, we want to make sure the vcoreid read doesn't come before the flag | 
|  | write (since uthread code now thinks it is on a particular vcore).  This | 
|  | write, then read would normally require a wrmb(), but is that overkill? | 
|  | Clearly, we need the compiler to issue the writes in order, so we need a cmb() | 
|  | at least.  Here's the code that the processor will get: | 
|  | orl    $0x1,0x254(%edi)      (write DONT_MIGRATE) | 
|  | mov    $0xfffffff0,%ebp      (getting ready with the TLS) | 
|  | mov    %gs:0x0(%ebp),%esi    (reading the vcoreid from TLS) | 
|  |  | 
|  | Do we need a wrmb() here?  Any remote code might see the write after the | 
|  | vcoreid read, but the read is a bit different than normal ones.  We aren't | 
|  | reading global memory, and we aren't trying to synchronize with another core. | 
|  | All that matters is that if the thread running this code saw the vcoreid read, | 
|  | then whoever reads the flag sees the write. | 
|  |  | 
|  | The 'whoever' is not concurrently running code - it is 2LS code that either | 
|  | runs on the vcore due to an IPI/notification, or it is 2LS code running | 
|  | remotely that responded to a preemption of that vcore.  Both of these cases | 
|  | require an IPI.  AFAIK, interrupts serialize memory operations, so whatever | 
|  | writes were issued before the interrupt hit memory (or cache) before we even | 
|  | do things like write out the trapframe of the thread.  If this isn't true, | 
|  | then the synchronization we do when writing out the trapframe (before allowing | 
|  | a remote core to try and recover the preempted uthread), will handle the | 
|  | DONT_MIGRATE write. | 
|  |  | 
|  | Anyway, the point is that remote code will look at it, but only when told to | 
|  | look.  That "telling" is the write, which happens after the | 
|  | synchronizing/serializing events of the IPI path. | 
|  |  | 
|  | 4. Memory Barriers and Locking | 
|  | ==================== | 
|  | The implementation of locks require memory barriers (both compiler and CPU). | 
|  | Regular users of locks do not need to worry about this.  Lock implementers do. | 
|  |  | 
|  | We need to consider reorderings of reads and writes from before and after the | 
|  | lock/unlock write.  In these next sections, the reads and writes I talk about | 
|  | are from a given thread/CPU.  Protected reads/writes are those that happen | 
|  | while the lock is held.  When I say you need a wmb(), you could get by with a | 
|  | cmb() and an atomic-RMW op: just so long as you have the cmb() and the | 
|  | approrpiate CPU mb() at a minimum. | 
|  |  | 
|  | 4.1: Locking | 
|  | --------------------- | 
|  | - Don't care about our reads or writes before the lock happening after the | 
|  | lock write. | 
|  | - Don't want protected writes slipping out before the lock write, need a wmb() | 
|  | after the lock write. | 
|  | - Don't want protected reads slipping out before the lock write, need a wrmb() | 
|  | after the lock write. | 
|  |  | 
|  | 4.2: Unlocking | 
|  | --------------------- | 
|  | - Don't want protected writes slipping out after the unlock, so we need a | 
|  | wmb() before the unlock write. | 
|  | - Don't want protected reads slipping out after the unlock, so we need a | 
|  | rwmb() before the unlock write.  Arguably, there is some causality that | 
|  | makes this less meaningful (what did you do with the info? if not a write | 
|  | that was protected, then who cares? ) | 
|  | - Don't care about reads or writes after the unlock happening earlier than the | 
|  | unlock write. | 
|  |  | 
|  | 5. Other Stuff | 
|  | ==================== | 
|  | Linux has a lot of work on memory barriers, far more advanced than our stuff. | 
|  | Some of it doesn't make any sense.  I've asked our architects about things | 
|  | like read_barrier_depends() and a few other things.  They also support some | 
|  | non-Intel x86 clones that need wmb_f() in place of a wmb() (support out of | 
|  | order writes).  If this pops up, we'll deal with it. | 
|  |  | 
|  | I chatted with Andrew a bit, and it turns out the following needs a barrier | 
|  | on P2 under the Alpha's memory model: | 
|  |  | 
|  | (global) int x = 0, *p = 0; | 
|  |  | 
|  | P1: | 
|  | x = 3; | 
|  | FENCE | 
|  | p = &x; | 
|  |  | 
|  | P2: | 
|  | while (p == NULL) ; | 
|  | assert(*p == 3); | 
|  |  | 
|  | As far as we can figure, you'd need some sort of 'value-speculating' hardware | 
|  | to make this an issue in practice.  For now, we'll mark these spots in the code | 
|  | if we see them, but I'm not overly concerned about it. | 
|  |  | 
|  | Also note that none of these barriers deal with things like page talble walks, | 
|  | page/segmentation update writes, non-temporal hints on writes, etc.  glhf with | 
|  | that, future self! |