Documentation/memory_barriers.txt - upstream - Git at Google

 memory_barriers.txt
 Barret Rhoden

 1. Overview
 2. General Rules
 3. Use in the Code Base
 4. Memory Barriers and Locking
 5. Other Stuff

 1. Overview
 ====================
 Memory barriers exist to make sure the compiler and the CPU do what we intend.
 The compiler memory barrier (cmb()) (called an optimization barrier in linux)
 prevents the compliler from reordering operations.  However, CPUs can also
 reorder reads and writes, in an architecture-dependent manner.  In most places
 with shared memory synchronization, you'll need some form of memory barriers.

 These barriers apply to 'unrelated' reads and writes.  The compiler or the CPU
 cannot detect any relationship between them, so it believes it is safe to
 reorder them.  The problem arises in that we attach some meaning to them,
 often in the form of signalling other cores.

 CPU memory barriers only apply when talking to different cores or hardware
 devices.  They do not matter when dealing with your own core (perhaps between
 a uthread and vcore context, running on the same core).  cmb()s still matter,
 even when the synchronizing code runs on the same core.  See Section 3 for
 more details.

 2. General Rules
 ====================
 2.1: Types of Memory Barriers
 ---------------------
 For CPU memory barriers, we have 5 types.
 - rmb() no reordering of reads with future reads
 - wmb() no reordering of writes with future writes
 - wrmb() no reordering of writes with future reads
 - rwmb() no reordering of reads with future writes
 - mb() no reordering of reads or writes with future reads or writes

 All 5 of these have a cmb() built in. (a "memory" clobber).

 Part of the reason for the distinction between wrmb/rwmb and the full mb is
 that on many machines (x86), rwmb() is a noop (for us).

 These barriers are used on 'normal' reads and writes, and they do not include
 streaming/SSE instructions and string ops (on x86), and they do not include
 talking with hardware devices.  For memory barriers for those types of
 operations, use the _f variety (force), e.g. rmb() -> rmb_f().

 2.2: Atomics
 ---------------------
 Most atomic operations, such as atomic_inc(), provide some form of memory
 barrier protection.  Specifically, all read-modify-write (RMW) atomic ops act
 as a CPU memory barrier (like an mb()), but do *not* provide a cmb().  They
 only provide a cmb() on the variables they apply to (i.e., variables in the
 clobber list).

 I considered making all atomics clobber "memory" (like the cmb()), but
 sync_fetch_and_add() and friends do not do this by default, and it also means
 that any use of atomics (even when we don't require a cmb()) then provides a
 cmb().

 Also note that not all atomic operations are RMW.  atomic_set(), _init(), and
 _read() do not enforce a memory barrier in the CPU.  If in doubt, look for the
 LOCK in the assembly (except for xchg, which is a locking function).  We're
 taking advantage of the LOCK the atomics provide to serialize and synchronize
 our memory.

 In a lot of places, even if we have an atomic I'll still put in the expected
 mb (e.g., a rmb()), especially if it clarifies the code.  When I rely on the
 atomic's LOCK, I'll make a note of it (such as in spinlock code).

 Finally, note that the atomic RMWs handle the mb_f()s just like they handle
 the regular memory barriers.  The LOCK prefix does quite a bit.

 These rules are a bit x86 specific, so for now other architectures will need
 to implement their atomics such that they provide the same effects.

 2.3: Locking
 ---------------------
 If you access shared memory variables only when inside locks, then you do not
 need to worry about memory barriers.  The details are sorted out in the
 locking code.

 3. Use in the Code Base
 ====================
 Figuring out how / when / if to use memory barriers is relatively easy.
 - First, notice when  you are using shared memory to synchronize.  Any time
   you are using a global variable or working on structs that someone else
   might look at concurrently, and you aren't using locks or another vetted
   tool (like krefs), then you need to worry about this.
 - Second, determine what reads and writes you are doing.
 - Third, determine who you are talking to.

 If you're talking to other cores or devices, you need CPU mbs.  If not, a cmb
 suffices.  Based on the types of reads and writes you are doing, just pick one
 of the 5 memory barriers.

 3.1: What's a Read or Write?
 ---------------------
 When writing code that synchronizes with other threads via shared memory, we
 have a variety of patterns.  Most infamous is the "signal, then check if the
 receiver is still listening", which is the critical part of the "check,
 signal, check again" pattern.  For examples, look at things like
 'notif_pending' and when we check VC_CAN_RCV_MSG in event.c.

 In these examples, "write" and "read" include things such as posting events or
 checking flags (which ultimately involve writes and reads).  You need to be
 aware that some functions, such as TAILQ_REMOVE are writes, even though it is
 not written as *x = 3;.  Whatever your code is, you should be able to point
 out what are the critical variables and their interleavings.  You should see
 how a CPU reordering would break your algorithm just like a poorly timed
 interrupt or other concurrent interleaving.

 When looking at a function that we consider a signal/write, you need to know
 what it does.  It might handle its memory barriers internally (protecting its
 own memory operations).  However, it might not.  In general, I err on the side
 of extra mbs, or at the very least leave a comment about what I am doing and
 what sort of barriers I desire from the code.

 3.2: Who Are We Talking To?
 ---------------------
 CPU memory barriers are necessary when synchronizing/talking with remote cores
 or devices, but not when "talking" with your own core.  For instance, if you
 issue two writes, then read them, you will see both writes (reads may not be
 reorderd with older writes to the same location on a single processor, and
 your reads get served out of the write buffer).  Note, the read can
 pass/happen before the write, but the CPU gives you the correct value that the
 write gave you (store-buffer forwarding).  Other cores may not see both of
 them due to reordering magic.  Put another way, since those remote processors
 didn't do the initial writing, the rule about reading the same location
 doesn't apply to them.

 Given this, when finding spots in the code that may require a mb(), I think
 about signalling a concurrent viewer on a different core.  A classic example
 is when we signal to say "process an item".  Say we were on one core and
 filled the struct out and then signalled, if we then started reading from that
 same core, we would see our old write (you see writes you previously wrote),
 but someone else on another core may see the signal before the filled out
 struct.

 There is still a distinction between the compiler reordering code and the
 processor reordering code.  Imagine the case of "filling a struct, then
 posting the pointer to the struct".  If the compiler reorders, the pointer may
 be posted before the struct is filled, and an interrupt may come in.  The
 interrupt handler may look at that structure and see invalid data.  This has
 nothing to do with the CPU reordering - simply the compiler messed with you.
 Note this only matters if we care about a concurrent interleaving (interrupt
 handler with a kthread for instance), and we ought to be aware that there is
 some shared state being mucked with.

 For a more complicated example, consider DONT_MIGRATE and reading vcoreid.
 Logically, we want to make sure the vcoreid read doesn't come before the flag
 write (since uthread code now thinks it is on a particular vcore).  This
 write, then read would normally require a wrmb(), but is that overkill?
 Clearly, we need the compiler to issue the writes in order, so we need a cmb()
 at least.  Here's the code that the processor will get:
 	orl    $0x1,0x254(%edi)      (write DONT_MIGRATE)
 	mov    $0xfffffff0,%ebp      (getting ready with the TLS)
 	mov    %gs:0x0(%ebp),%esi    (reading the vcoreid from TLS)

 Do we need a wrmb() here?  Any remote code might see the write after the
 vcoreid read, but the read is a bit different than normal ones.  We aren't
 reading global memory, and we aren't trying to synchronize with another core.
 All that matters is that if the thread running this code saw the vcoreid read,
 then whoever reads the flag sees the write.

 The 'whoever' is not concurrently running code - it is 2LS code that either
 runs on the vcore due to an IPI/notification, or it is 2LS code running
 remotely that responded to a preemption of that vcore.  Both of these cases
 require an IPI.  AFAIK, interrupts serialize memory operations, so whatever
 writes were issued before the interrupt hit memory (or cache) before we even
 do things like write out the trapframe of the thread.  If this isn't true,
 then the synchronization we do when writing out the trapframe (before allowing
 a remote core to try and recover the preempted uthread), will handle the
 DONT_MIGRATE write.

 Anyway, the point is that remote code will look at it, but only when told to
 look.  That "telling" is the write, which happens after the
 synchronizing/serializing events of the IPI path.

 4. Memory Barriers and Locking
 ====================
 The implementation of locks require memory barriers (both compiler and CPU).
 Regular users of locks do not need to worry about this.  Lock implementers do.

 We need to consider reorderings of reads and writes from before and after the
 lock/unlock write.  In these next sections, the reads and writes I talk about
 are from a given thread/CPU.  Protected reads/writes are those that happen
 while the lock is held.  When I say you need a wmb(), you could get by with a
 cmb() and an atomic-RMW op: just so long as you have the cmb() and the
 approrpiate CPU mb() at a minimum.

 4.1: Locking
 ---------------------
 - Don't care about our reads or writes before the lock happening after the
   lock write.
 - Don't want protected writes slipping out before the lock write, need a wmb()
   after the lock write.
 - Don't want protected reads slipping out before the lock write, need a wrmb()
   after the lock write.

 4.2: Unlocking
 ---------------------
 - Don't want protected writes slipping out after the unlock, so we need a
   wmb() before the unlock write.
 - Don't want protected reads slipping out after the unlock, so we need a
   rwmb() before the unlock write.  Arguably, there is some causality that
   makes this less meaningful (what did you do with the info? if not a write
   that was protected, then who cares? )
 - Don't care about reads or writes after the unlock happening earlier than the
   unlock write.

 5. Other Stuff
 ====================
 Linux has a lot of work on memory barriers, far more advanced than our stuff.
 Some of it doesn't make any sense.  I've asked our architects about things
 like read_barrier_depends() and a few other things.  They also support some
 non-Intel x86 clones that need wmb_f() in place of a wmb() (support out of
 order writes).  If this pops up, we'll deal with it.

 I chatted with Andrew a bit, and it turns out the following needs a barrier
 on P2 under the Alpha's memory model:

 	(global) int x = 0, *p = 0;

 	P1:
 	x = 3;
 	FENCE
 	p = &x;

 	P2:
 	while (p == NULL) ;
 	assert(*p == 3);

 As far as we can figure, you'd need some sort of 'value-speculating' hardware
 to make this an issue in practice.  For now, we'll mark these spots in the code
 if we see them, but I'm not overly concerned about it.

 Also note that none of these barriers deal with things like page talble walks,
 page/segmentation update writes, non-temporal hints on writes, etc.  glhf with
 that, future self!
	memory_barriers.txt
	Barret Rhoden

	1. Overview
	2. General Rules
	3. Use in the Code Base
	4. Memory Barriers and Locking
	5. Other Stuff

	1. Overview
	====================
	Memory barriers exist to make sure the compiler and the CPU do what we intend.
	The compiler memory barrier (cmb()) (called an optimization barrier in linux)
	prevents the compliler from reordering operations. However, CPUs can also
	reorder reads and writes, in an architecture-dependent manner. In most places
	with shared memory synchronization, you'll need some form of memory barriers.

	These barriers apply to 'unrelated' reads and writes. The compiler or the CPU
	cannot detect any relationship between them, so it believes it is safe to
	reorder them. The problem arises in that we attach some meaning to them,
	often in the form of signalling other cores.

	CPU memory barriers only apply when talking to different cores or hardware
	devices. They do not matter when dealing with your own core (perhaps between
	a uthread and vcore context, running on the same core). cmb()s still matter,
	even when the synchronizing code runs on the same core. See Section 3 for
	more details.

	2. General Rules
	====================
	2.1: Types of Memory Barriers
	---------------------
	For CPU memory barriers, we have 5 types.
	- rmb() no reordering of reads with future reads
	- wmb() no reordering of writes with future writes
	- wrmb() no reordering of writes with future reads
	- rwmb() no reordering of reads with future writes
	- mb() no reordering of reads or writes with future reads or writes

	All 5 of these have a cmb() built in. (a "memory" clobber).

	Part of the reason for the distinction between wrmb/rwmb and the full mb is
	that on many machines (x86), rwmb() is a noop (for us).

	These barriers are used on 'normal' reads and writes, and they do not include
	streaming/SSE instructions and string ops (on x86), and they do not include
	talking with hardware devices. For memory barriers for those types of
	operations, use the _f variety (force), e.g. rmb() -> rmb_f().

	2.2: Atomics
	---------------------
	Most atomic operations, such as atomic_inc(), provide some form of memory
	barrier protection. Specifically, all read-modify-write (RMW) atomic ops act
	as a CPU memory barrier (like an mb()), but do not provide a cmb(). They
	only provide a cmb() on the variables they apply to (i.e., variables in the
	clobber list).

	I considered making all atomics clobber "memory" (like the cmb()), but
	sync_fetch_and_add() and friends do not do this by default, and it also means
	that any use of atomics (even when we don't require a cmb()) then provides a
	cmb().

	Also note that not all atomic operations are RMW. atomic_set(), _init(), and
	_read() do not enforce a memory barrier in the CPU. If in doubt, look for the
	LOCK in the assembly (except for xchg, which is a locking function). We're
	taking advantage of the LOCK the atomics provide to serialize and synchronize
	our memory.

	In a lot of places, even if we have an atomic I'll still put in the expected
	mb (e.g., a rmb()), especially if it clarifies the code. When I rely on the
	atomic's LOCK, I'll make a note of it (such as in spinlock code).

	Finally, note that the atomic RMWs handle the mb_f()s just like they handle
	the regular memory barriers. The LOCK prefix does quite a bit.

	These rules are a bit x86 specific, so for now other architectures will need
	to implement their atomics such that they provide the same effects.

	2.3: Locking
	---------------------
	If you access shared memory variables only when inside locks, then you do not
	need to worry about memory barriers. The details are sorted out in the
	locking code.

	3. Use in the Code Base
	====================
	Figuring out how / when / if to use memory barriers is relatively easy.
	- First, notice when you are using shared memory to synchronize. Any time
	you are using a global variable or working on structs that someone else
	might look at concurrently, and you aren't using locks or another vetted
	tool (like krefs), then you need to worry about this.
	- Second, determine what reads and writes you are doing.
	- Third, determine who you are talking to.

	If you're talking to other cores or devices, you need CPU mbs. If not, a cmb
	suffices. Based on the types of reads and writes you are doing, just pick one
	of the 5 memory barriers.

	3.1: What's a Read or Write?
	---------------------
	When writing code that synchronizes with other threads via shared memory, we
	have a variety of patterns. Most infamous is the "signal, then check if the
	receiver is still listening", which is the critical part of the "check,
	signal, check again" pattern. For examples, look at things like
	'notif_pending' and when we check VC_CAN_RCV_MSG in event.c.

	In these examples, "write" and "read" include things such as posting events or
	checking flags (which ultimately involve writes and reads). You need to be
	aware that some functions, such as TAILQ_REMOVE are writes, even though it is
	not written as *x = 3;. Whatever your code is, you should be able to point
	out what are the critical variables and their interleavings. You should see
	how a CPU reordering would break your algorithm just like a poorly timed
	interrupt or other concurrent interleaving.

	When looking at a function that we consider a signal/write, you need to know
	what it does. It might handle its memory barriers internally (protecting its
	own memory operations). However, it might not. In general, I err on the side
	of extra mbs, or at the very least leave a comment about what I am doing and
	what sort of barriers I desire from the code.

	3.2: Who Are We Talking To?
	---------------------
	CPU memory barriers are necessary when synchronizing/talking with remote cores
	or devices, but not when "talking" with your own core. For instance, if you
	issue two writes, then read them, you will see both writes (reads may not be
	reorderd with older writes to the same location on a single processor, and
	your reads get served out of the write buffer). Note, the read can
	pass/happen before the write, but the CPU gives you the correct value that the
	write gave you (store-buffer forwarding). Other cores may not see both of
	them due to reordering magic. Put another way, since those remote processors
	didn't do the initial writing, the rule about reading the same location
	doesn't apply to them.

	Given this, when finding spots in the code that may require a mb(), I think
	about signalling a concurrent viewer on a different core. A classic example
	is when we signal to say "process an item". Say we were on one core and
	filled the struct out and then signalled, if we then started reading from that
	same core, we would see our old write (you see writes you previously wrote),
	but someone else on another core may see the signal before the filled out
	struct.

	There is still a distinction between the compiler reordering code and the
	processor reordering code. Imagine the case of "filling a struct, then
	posting the pointer to the struct". If the compiler reorders, the pointer may
	be posted before the struct is filled, and an interrupt may come in. The
	interrupt handler may look at that structure and see invalid data. This has
	nothing to do with the CPU reordering - simply the compiler messed with you.
	Note this only matters if we care about a concurrent interleaving (interrupt
	handler with a kthread for instance), and we ought to be aware that there is
	some shared state being mucked with.

	For a more complicated example, consider DONT_MIGRATE and reading vcoreid.
	Logically, we want to make sure the vcoreid read doesn't come before the flag
	write (since uthread code now thinks it is on a particular vcore). This
	write, then read would normally require a wrmb(), but is that overkill?
	Clearly, we need the compiler to issue the writes in order, so we need a cmb()
	at least. Here's the code that the processor will get:
	orl $0x1,0x254(%edi) (write DONT_MIGRATE)
	mov $0xfffffff0,%ebp (getting ready with the TLS)
	mov %gs:0x0(%ebp),%esi (reading the vcoreid from TLS)

	Do we need a wrmb() here? Any remote code might see the write after the
	vcoreid read, but the read is a bit different than normal ones. We aren't
	reading global memory, and we aren't trying to synchronize with another core.
	All that matters is that if the thread running this code saw the vcoreid read,
	then whoever reads the flag sees the write.

	The 'whoever' is not concurrently running code - it is 2LS code that either
	runs on the vcore due to an IPI/notification, or it is 2LS code running
	remotely that responded to a preemption of that vcore. Both of these cases
	require an IPI. AFAIK, interrupts serialize memory operations, so whatever
	writes were issued before the interrupt hit memory (or cache) before we even
	do things like write out the trapframe of the thread. If this isn't true,
	then the synchronization we do when writing out the trapframe (before allowing
	a remote core to try and recover the preempted uthread), will handle the
	DONT_MIGRATE write.

	Anyway, the point is that remote code will look at it, but only when told to
	look. That "telling" is the write, which happens after the
	synchronizing/serializing events of the IPI path.

	4. Memory Barriers and Locking
	====================
	The implementation of locks require memory barriers (both compiler and CPU).
	Regular users of locks do not need to worry about this. Lock implementers do.

	We need to consider reorderings of reads and writes from before and after the
	lock/unlock write. In these next sections, the reads and writes I talk about
	are from a given thread/CPU. Protected reads/writes are those that happen
	while the lock is held. When I say you need a wmb(), you could get by with a
	cmb() and an atomic-RMW op: just so long as you have the cmb() and the
	approrpiate CPU mb() at a minimum.

	4.1: Locking
	---------------------
	- Don't care about our reads or writes before the lock happening after the
	lock write.
	- Don't want protected writes slipping out before the lock write, need a wmb()
	after the lock write.
	- Don't want protected reads slipping out before the lock write, need a wrmb()
	after the lock write.

	4.2: Unlocking
	---------------------
	- Don't want protected writes slipping out after the unlock, so we need a
	wmb() before the unlock write.
	- Don't want protected reads slipping out after the unlock, so we need a
	rwmb() before the unlock write. Arguably, there is some causality that
	makes this less meaningful (what did you do with the info? if not a write
	that was protected, then who cares? )
	- Don't care about reads or writes after the unlock happening earlier than the
	unlock write.

	5. Other Stuff
	====================
	Linux has a lot of work on memory barriers, far more advanced than our stuff.
	Some of it doesn't make any sense. I've asked our architects about things
	like read_barrier_depends() and a few other things. They also support some
	non-Intel x86 clones that need wmb_f() in place of a wmb() (support out of
	order writes). If this pops up, we'll deal with it.

	I chatted with Andrew a bit, and it turns out the following needs a barrier
	on P2 under the Alpha's memory model:

	(global) int x = 0, *p = 0;

	P1:
	x = 3;
	FENCE
	p = &x;

	P2:
	while (p == NULL) ;
	assert(*p == 3);

	As far as we can figure, you'd need some sort of 'value-speculating' hardware
	to make this an issue in practice. For now, we'll mark these spots in the code
	if we see them, but I'm not overly concerned about it.

	Also note that none of these barriers deal with things like page talble walks,
	page/segmentation update writes, non-temporal hints on writes, etc. glhf with
	that, future self!