blob: 64887ea9fbc3092f60ba457f539f6b48f2826402 [file] [log] [blame]
memory_barriers.txt
Barret Rhoden
1. Overview
2. General Rules
3. Use in the Code Base
4. Memory Barriers and Locking
5. Other Stuff
1. Overview
====================
Memory barriers exist to make sure the compiler and the CPU do what we intend.
The compiler memory barrier (cmb()) (called an optimization barrier in linux)
prevents the compliler from reordering operations. However, CPUs can also
reorder reads and writes, in an architecture-dependent manner. In most places
with shared memory synchronization, you'll need some form of memory barriers.
These barriers apply to 'unrelated' reads and writes. The compiler or the CPU
cannot detect any relationship between them, so it believes it is safe to
reorder them. The problem arises in that we attach some meaning to them,
often in the form of signalling other cores.
CPU memory barriers only apply when talking to different cores or hardware
devices. They do not matter when dealing with your own core (perhaps between
a uthread and vcore context, running on the same core). cmb()s still matter,
even when the synchronizing code runs on the same core. See Section 3 for
more details.
2. General Rules
====================
2.1: Types of Memory Barriers
---------------------
For CPU memory barriers, we have 5 types.
- rmb() no reordering of reads with future reads
- wmb() no reordering of writes with future writes
- wrmb() no reordering of writes with future reads
- rwmb() no reordering of reads with future writes
- mb() no reordering of reads or writes with future reads or writes
All 5 of these have a cmb() built in. (a "memory" clobber).
Part of the reason for the distinction between wrmb/rwmb and the full mb is
that on many machines (x86), rwmb() is a noop (for us).
These barriers are used on 'normal' reads and writes, and they do not include
streaming/SSE instructions and string ops (on x86), and they do not include
talking with hardware devices. For memory barriers for those types of
operations, use the _f variety (force), e.g. rmb() -> rmb_f().
2.2: Atomics
---------------------
Most atomic operations, such as atomic_inc(), provide some form of memory
barrier protection. Specifically, all read-modify-write (RMW) atomic ops act
as a CPU memory barrier (like an mb()), but do *not* provide a cmb(). They
only provide a cmb() on the variables they apply to (i.e., variables in the
clobber list).
I considered making all atomics clobber "memory" (like the cmb()), but
sync_fetch_and_add() and friends do not do this by default, and it also means
that any use of atomics (even when we don't require a cmb()) then provides a
cmb().
Also note that not all atomic operations are RMW. atomic_set(), _init(), and
_read() do not enforce a memory barrier in the CPU. If in doubt, look for the
LOCK in the assembly (except for xchg, which is a locking function). We're
taking advantage of the LOCK the atomics provide to serialize and synchronize
our memory.
In a lot of places, even if we have an atomic I'll still put in the expected
mb (e.g., a rmb()), especially if it clarifies the code. When I rely on the
atomic's LOCK, I'll make a note of it (such as in spinlock code).
Finally, note that the atomic RMWs handle the mb_f()s just like they handle
the regular memory barriers. The LOCK prefix does quite a bit.
These rules are a bit x86 specific, so for now other architectures will need
to implement their atomics such that they provide the same effects.
2.3: Locking
---------------------
If you access shared memory variables only when inside locks, then you do not
need to worry about memory barriers. The details are sorted out in the
locking code.
3. Use in the Code Base
====================
Figuring out how / when / if to use memory barriers is relatively easy.
- First, notice when you are using shared memory to synchronize. Any time
you are using a global variable or working on structs that someone else
might look at concurrently, and you aren't using locks or another vetted
tool (like krefs), then you need to worry about this.
- Second, determine what reads and writes you are doing.
- Third, determine who you are talking to.
If you're talking to other cores or devices, you need CPU mbs. If not, a cmb
suffices. Based on the types of reads and writes you are doing, just pick one
of the 5 memory barriers.
3.1: What's a Read or Write?
---------------------
When writing code that synchronizes with other threads via shared memory, we
have a variety of patterns. Most infamous is the "signal, then check if the
receiver is still listening", which is the critical part of the "check,
signal, check again" pattern. For examples, look at things like
'notif_pending' and when we check VC_CAN_RCV_MSG in event.c.
In these examples, "write" and "read" include things such as posting events or
checking flags (which ultimately involve writes and reads). You need to be
aware that some functions, such as TAILQ_REMOVE are writes, even though it is
not written as *x = 3;. Whatever your code is, you should be able to point
out what are the critical variables and their interleavings. You should see
how a CPU reordering would break your algorithm just like a poorly timed
interrupt or other concurrent interleaving.
When looking at a function that we consider a signal/write, you need to know
what it does. It might handle its memory barriers internally (protecting its
own memory operations). However, it might not. In general, I err on the side
of extra mbs, or at the very least leave a comment about what I am doing and
what sort of barriers I desire from the code.
3.2: Who Are We Talking To?
---------------------
CPU memory barriers are necessary when synchronizing/talking with remote cores
or devices, but not when "talking" with your own core. For instance, if you
issue two writes, then read them, you will see both writes (reads may not be
reorderd with older writes to the same location on a single processor, and
your reads get served out of the write buffer). Note, the read can
pass/happen before the write, but the CPU gives you the correct value that the
write gave you (store-buffer forwarding). Other cores may not see both of
them due to reordering magic. Put another way, since those remote processors
didn't do the initial writing, the rule about reading the same location
doesn't apply to them.
Given this, when finding spots in the code that may require a mb(), I think
about signalling a concurrent viewer on a different core. A classic example
is when we signal to say "process an item". Say we were on one core and
filled the struct out and then signalled, if we then started reading from that
same core, we would see our old write (you see writes you previously wrote),
but someone else on another core may see the signal before the filled out
struct.
There is still a distinction between the compiler reordering code and the
processor reordering code. Imagine the case of "filling a struct, then
posting the pointer to the struct". If the compiler reorders, the pointer may
be posted before the struct is filled, and an interrupt may come in. The
interrupt handler may look at that structure and see invalid data. This has
nothing to do with the CPU reordering - simply the compiler messed with you.
Note this only matters if we care about a concurrent interleaving (interrupt
handler with a kthread for instance), and we ought to be aware that there is
some shared state being mucked with.
For a more complicated example, consider DONT_MIGRATE and reading vcoreid.
Logically, we want to make sure the vcoreid read doesn't come before the flag
write (since uthread code now thinks it is on a particular vcore). This
write, then read would normally require a wrmb(), but is that overkill?
Clearly, we need the compiler to issue the writes in order, so we need a cmb()
at least. Here's the code that the processor will get:
orl $0x1,0x254(%edi) (write DONT_MIGRATE)
mov $0xfffffff0,%ebp (getting ready with the TLS)
mov %gs:0x0(%ebp),%esi (reading the vcoreid from TLS)
Do we need a wrmb() here? Any remote code might see the write after the
vcoreid read, but the read is a bit different than normal ones. We aren't
reading global memory, and we aren't trying to synchronize with another core.
All that matters is that if the thread running this code saw the vcoreid read,
then whoever reads the flag sees the write.
The 'whoever' is not concurrently running code - it is 2LS code that either
runs on the vcore due to an IPI/notification, or it is 2LS code running
remotely that responded to a preemption of that vcore. Both of these cases
require an IPI. AFAIK, interrupts serialize memory operations, so whatever
writes were issued before the interrupt hit memory (or cache) before we even
do things like write out the trapframe of the thread. If this isn't true,
then the synchronization we do when writing out the trapframe (before allowing
a remote core to try and recover the preempted uthread), will handle the
DONT_MIGRATE write.
Anyway, the point is that remote code will look at it, but only when told to
look. That "telling" is the write, which happens after the
synchronizing/serializing events of the IPI path.
4. Memory Barriers and Locking
====================
The implementation of locks require memory barriers (both compiler and CPU).
Regular users of locks do not need to worry about this. Lock implementers do.
We need to consider reorderings of reads and writes from before and after the
lock/unlock write. In these next sections, the reads and writes I talk about
are from a given thread/CPU. Protected reads/writes are those that happen
while the lock is held. When I say you need a wmb(), you could get by with a
cmb() and an atomic-RMW op: just so long as you have the cmb() and the
approrpiate CPU mb() at a minimum.
4.1: Locking
---------------------
- Don't care about our reads or writes before the lock happening after the
lock write.
- Don't want protected writes slipping out before the lock write, need a wmb()
after the lock write.
- Don't want protected reads slipping out before the lock write, need a wrmb()
after the lock write.
4.2: Unlocking
---------------------
- Don't want protected writes slipping out after the unlock, so we need a
wmb() before the unlock write.
- Don't want protected reads slipping out after the unlock, so we need a
rwmb() before the unlock write. Arguably, there is some causality that
makes this less meaningful (what did you do with the info? if not a write
that was protected, then who cares? )
- Don't care about reads or writes after the unlock happening earlier than the
unlock write.
5. Other Stuff
====================
Linux has a lot of work on memory barriers, far more advanced than our stuff.
Some of it doesn't make any sense. I've asked our architects about things
like read_barrier_depends() and a few other things. They also support some
non-Intel x86 clones that need wmb_f() in place of a wmb() (support out of
order writes). If this pops up, we'll deal with it.
I chatted with Andrew a bit, and it turns out the following needs a barrier
on P2 under the Alpha's memory model:
(global) int x = 0, *p = 0;
P1:
x = 3;
FENCE
p = &x;
P2:
while (p == NULL) ;
assert(*p == 3);
As far as we can figure, you'd need some sort of 'value-speculating' hardware
to make this an issue in practice. For now, we'll mark these spots in the code
if we see them, but I'm not overly concerned about it.
Also note that none of these barriers deal with things like page talble walks,
page/segmentation update writes, non-temporal hints on writes, etc. glhf with
that, future self!