| memory_barriers.txt |
| Barret Rhoden |
| |
| 1. Overview |
| 2. General Rules |
| 3. Use in the Code Base |
| 4. Memory Barriers and Locking |
| 5. Other Stuff |
| |
| 1. Overview |
| ==================== |
| Memory barriers exist to make sure the compiler and the CPU do what we intend. |
| The compiler memory barrier (cmb()) (called an optimization barrier in linux) |
| prevents the compliler from reordering operations. However, CPUs can also |
| reorder reads and writes, in an architecture-dependent manner. In most places |
| with shared memory synchronization, you'll need some form of memory barriers. |
| |
| These barriers apply to 'unrelated' reads and writes. The compiler or the CPU |
| cannot detect any relationship between them, so it believes it is safe to |
| reorder them. The problem arises in that we attach some meaning to them, |
| often in the form of signalling other cores. |
| |
| CPU memory barriers only apply when talking to different cores or hardware |
| devices. They do not matter when dealing with your own core (perhaps between |
| a uthread and vcore context, running on the same core). cmb()s still matter, |
| even when the synchronizing code runs on the same core. See Section 3 for |
| more details. |
| |
| 2. General Rules |
| ==================== |
| 2.1: Types of Memory Barriers |
| --------------------- |
| For CPU memory barriers, we have 5 types. |
| - rmb() no reordering of reads with future reads |
| - wmb() no reordering of writes with future writes |
| - wrmb() no reordering of writes with future reads |
| - rwmb() no reordering of reads with future writes |
| - mb() no reordering of reads or writes with future reads or writes |
| |
| All 5 of these have a cmb() built in. (a "memory" clobber). |
| |
| Part of the reason for the distinction between wrmb/rwmb and the full mb is |
| that on many machines (x86), rwmb() is a noop (for us). |
| |
| These barriers are used on 'normal' reads and writes, and they do not include |
| streaming/SSE instructions and string ops (on x86), and they do not include |
| talking with hardware devices. For memory barriers for those types of |
| operations, use the _f variety (force), e.g. rmb() -> rmb_f(). |
| |
| 2.2: Atomics |
| --------------------- |
| Most atomic operations, such as atomic_inc(), provide some form of memory |
| barrier protection. Specifically, all read-modify-write (RMW) atomic ops act |
| as a CPU memory barrier (like an mb()), but do *not* provide a cmb(). They |
| only provide a cmb() on the variables they apply to (i.e., variables in the |
| clobber list). |
| |
| I considered making all atomics clobber "memory" (like the cmb()), but |
| sync_fetch_and_add() and friends do not do this by default, and it also means |
| that any use of atomics (even when we don't require a cmb()) then provides a |
| cmb(). |
| |
| Also note that not all atomic operations are RMW. atomic_set(), _init(), and |
| _read() do not enforce a memory barrier in the CPU. If in doubt, look for the |
| LOCK in the assembly (except for xchg, which is a locking function). We're |
| taking advantage of the LOCK the atomics provide to serialize and synchronize |
| our memory. |
| |
| In a lot of places, even if we have an atomic I'll still put in the expected |
| mb (e.g., a rmb()), especially if it clarifies the code. When I rely on the |
| atomic's LOCK, I'll make a note of it (such as in spinlock code). |
| |
| Finally, note that the atomic RMWs handle the mb_f()s just like they handle |
| the regular memory barriers. The LOCK prefix does quite a bit. |
| |
| These rules are a bit x86 specific, so for now other architectures will need |
| to implement their atomics such that they provide the same effects. |
| |
| 2.3: Locking |
| --------------------- |
| If you access shared memory variables only when inside locks, then you do not |
| need to worry about memory barriers. The details are sorted out in the |
| locking code. |
| |
| 3. Use in the Code Base |
| ==================== |
| Figuring out how / when / if to use memory barriers is relatively easy. |
| - First, notice when you are using shared memory to synchronize. Any time |
| you are using a global variable or working on structs that someone else |
| might look at concurrently, and you aren't using locks or another vetted |
| tool (like krefs), then you need to worry about this. |
| - Second, determine what reads and writes you are doing. |
| - Third, determine who you are talking to. |
| |
| If you're talking to other cores or devices, you need CPU mbs. If not, a cmb |
| suffices. Based on the types of reads and writes you are doing, just pick one |
| of the 5 memory barriers. |
| |
| 3.1: What's a Read or Write? |
| --------------------- |
| When writing code that synchronizes with other threads via shared memory, we |
| have a variety of patterns. Most infamous is the "signal, then check if the |
| receiver is still listening", which is the critical part of the "check, |
| signal, check again" pattern. For examples, look at things like |
| 'notif_pending' and when we check VC_CAN_RCV_MSG in event.c. |
| |
| In these examples, "write" and "read" include things such as posting events or |
| checking flags (which ultimately involve writes and reads). You need to be |
| aware that some functions, such as TAILQ_REMOVE are writes, even though it is |
| not written as *x = 3;. Whatever your code is, you should be able to point |
| out what are the critical variables and their interleavings. You should see |
| how a CPU reordering would break your algorithm just like a poorly timed |
| interrupt or other concurrent interleaving. |
| |
| When looking at a function that we consider a signal/write, you need to know |
| what it does. It might handle its memory barriers internally (protecting its |
| own memory operations). However, it might not. In general, I err on the side |
| of extra mbs, or at the very least leave a comment about what I am doing and |
| what sort of barriers I desire from the code. |
| |
| 3.2: Who Are We Talking To? |
| --------------------- |
| CPU memory barriers are necessary when synchronizing/talking with remote cores |
| or devices, but not when "talking" with your own core. For instance, if you |
| issue two writes, then read them, you will see both writes (reads may not be |
| reorderd with older writes to the same location on a single processor, and |
| your reads get served out of the write buffer). Note, the read can |
| pass/happen before the write, but the CPU gives you the correct value that the |
| write gave you (store-buffer forwarding). Other cores may not see both of |
| them due to reordering magic. Put another way, since those remote processors |
| didn't do the initial writing, the rule about reading the same location |
| doesn't apply to them. |
| |
| Given this, when finding spots in the code that may require a mb(), I think |
| about signalling a concurrent viewer on a different core. A classic example |
| is when we signal to say "process an item". Say we were on one core and |
| filled the struct out and then signalled, if we then started reading from that |
| same core, we would see our old write (you see writes you previously wrote), |
| but someone else on another core may see the signal before the filled out |
| struct. |
| |
| There is still a distinction between the compiler reordering code and the |
| processor reordering code. Imagine the case of "filling a struct, then |
| posting the pointer to the struct". If the compiler reorders, the pointer may |
| be posted before the struct is filled, and an interrupt may come in. The |
| interrupt handler may look at that structure and see invalid data. This has |
| nothing to do with the CPU reordering - simply the compiler messed with you. |
| Note this only matters if we care about a concurrent interleaving (interrupt |
| handler with a kthread for instance), and we ought to be aware that there is |
| some shared state being mucked with. |
| |
| For a more complicated example, consider DONT_MIGRATE and reading vcoreid. |
| Logically, we want to make sure the vcoreid read doesn't come before the flag |
| write (since uthread code now thinks it is on a particular vcore). This |
| write, then read would normally require a wrmb(), but is that overkill? |
| Clearly, we need the compiler to issue the writes in order, so we need a cmb() |
| at least. Here's the code that the processor will get: |
| orl $0x1,0x254(%edi) (write DONT_MIGRATE) |
| mov $0xfffffff0,%ebp (getting ready with the TLS) |
| mov %gs:0x0(%ebp),%esi (reading the vcoreid from TLS) |
| |
| Do we need a wrmb() here? Any remote code might see the write after the |
| vcoreid read, but the read is a bit different than normal ones. We aren't |
| reading global memory, and we aren't trying to synchronize with another core. |
| All that matters is that if the thread running this code saw the vcoreid read, |
| then whoever reads the flag sees the write. |
| |
| The 'whoever' is not concurrently running code - it is 2LS code that either |
| runs on the vcore due to an IPI/notification, or it is 2LS code running |
| remotely that responded to a preemption of that vcore. Both of these cases |
| require an IPI. AFAIK, interrupts serialize memory operations, so whatever |
| writes were issued before the interrupt hit memory (or cache) before we even |
| do things like write out the trapframe of the thread. If this isn't true, |
| then the synchronization we do when writing out the trapframe (before allowing |
| a remote core to try and recover the preempted uthread), will handle the |
| DONT_MIGRATE write. |
| |
| Anyway, the point is that remote code will look at it, but only when told to |
| look. That "telling" is the write, which happens after the |
| synchronizing/serializing events of the IPI path. |
| |
| 4. Memory Barriers and Locking |
| ==================== |
| The implementation of locks require memory barriers (both compiler and CPU). |
| Regular users of locks do not need to worry about this. Lock implementers do. |
| |
| We need to consider reorderings of reads and writes from before and after the |
| lock/unlock write. In these next sections, the reads and writes I talk about |
| are from a given thread/CPU. Protected reads/writes are those that happen |
| while the lock is held. When I say you need a wmb(), you could get by with a |
| cmb() and an atomic-RMW op: just so long as you have the cmb() and the |
| approrpiate CPU mb() at a minimum. |
| |
| 4.1: Locking |
| --------------------- |
| - Don't care about our reads or writes before the lock happening after the |
| lock write. |
| - Don't want protected writes slipping out before the lock write, need a wmb() |
| after the lock write. |
| - Don't want protected reads slipping out before the lock write, need a wrmb() |
| after the lock write. |
| |
| 4.2: Unlocking |
| --------------------- |
| - Don't want protected writes slipping out after the unlock, so we need a |
| wmb() before the unlock write. |
| - Don't want protected reads slipping out after the unlock, so we need a |
| rwmb() before the unlock write. Arguably, there is some causality that |
| makes this less meaningful (what did you do with the info? if not a write |
| that was protected, then who cares? ) |
| - Don't care about reads or writes after the unlock happening earlier than the |
| unlock write. |
| |
| 5. Other Stuff |
| ==================== |
| Linux has a lot of work on memory barriers, far more advanced than our stuff. |
| Some of it doesn't make any sense. I've asked our architects about things |
| like read_barrier_depends() and a few other things. They also support some |
| non-Intel x86 clones that need wmb_f() in place of a wmb() (support out of |
| order writes). If this pops up, we'll deal with it. |
| |
| I chatted with Andrew a bit, and it turns out the following needs a barrier |
| on P2 under the Alpha's memory model: |
| |
| (global) int x = 0, *p = 0; |
| |
| P1: |
| x = 3; |
| FENCE |
| p = &x; |
| |
| P2: |
| while (p == NULL) ; |
| assert(*p == 3); |
| |
| As far as we can figure, you'd need some sort of 'value-speculating' hardware |
| to make this an issue in practice. For now, we'll mark these spots in the code |
| if we see them, but I'm not overly concerned about it. |
| |
| Also note that none of these barriers deal with things like page talble walks, |
| page/segmentation update writes, non-temporal hints on writes, etc. glhf with |
| that, future self! |