Documentation/ucq.txt - akaros - Git at Google

 ucq.txt
 Barret Rhoden

 1. Overview
 ====================
 1.1 What are they?
 ------------------------------------------------------------------
 UCQs are a tool to send messages, with payloads, from the kernel to a process
 through shared memory.  The depth of the queue is unlimited, barring running
 out of memory.  They should be used in closed loop or low rate scenarios that
 require a payload (or that building a system that can handle missing a message
 is unwieldy).

 The old BCQs were designed for a shared memory segment, such as procdata.
 Since the kernel is sending the messages, we can use a more customized
 solution.  The problems in this area are running out of memory, having the
 kernel use user-provided-pointers, and synchronizing within the kernel (like a
 spinlock for a TAILQ).

 The basic plan for these "unbounded concurrent queues" (UCQ) is to have linked
 mmap'd pages of arrays of event messages (the BCQ is a circular array of event
 messages, roughly).  As the kernel over-produces, it mmaps more pages and links
 them together (via a header at the beginning of the page).  Userspace munmaps
 when it is done with a page.  To avoid excessive mmap/munmap, we double-buffer
 with two pages, one current and one spare.  We only need to mmap new ones when
 the kernel gets far ahead of the user.

 - When we run out of room, the kernel will implicitly mmap another page,
   solving the memory allocation issue.  The worst thing userspace can do is
 leak its memory, or cause an OOM condition, which we have to deal with anyway.

 - Using user-pointers isn't conceptually a problem any more, so long as the
   kernel reads it in and verifies the address is in userspace (and that it can
 handle page faults).  This wasn't the case a couple years ago when I made the
 BCQs.  We still are careful about pointers - we only use them when messing with
 which page is current, etc, and not when atomically adding items.

 - Swapping pages/buffers requires locking, but we can't put the lock in the UCQ
   structure, since userspace could muck with it.  Instead, we lock at the
 process level.  And instead of grabbing the process lock, we'll grab a hash
 lock (hash table of locks, hashed on the pointer of the UCQ).  This will happen
 every ~170 events or so.  Synchronization for normal ops (not buffer swaps) are
 done with atomics.

 Another option instead of allocating more memory would be to have the kernel
 block kthreads until the queue empties.  I really dislike that for a couple
 reasons.  It's easier to handle running out of memory than spawning too many
 kthreads, esp in critical parts of the code (where we can't sleep on memory
 allocation).  More importantly, it's a real pain to block on a user's actions.
 For instance, I'm not about to put a semaphore in user-writable memory.

 1.2 Why do we have them?
 ------------------------------------------------------------------
 Enqueuing messages in BCQs could fail, due to either a misbehaving process, an
 unlucky circumstance, or most commonly because the queue was full.  Event
 queues, which used BCQs as a building block, would handle the failure as an
 'overflow' of events, and signal the process with a bit.  This means the
 program needs to know how to handle overflow, which becomes painful.

 A specific case was syscall completion events.  When a syscall was done, the
 kernel would send a message that a particular syscall was done.  Userspace
 needed to know exactly which one was done, under normal circumstances.  With
 BCQs, the 2LS needed to know how to handle overflow, which means it needs to
 track every outstanding syscall so that it can poll to see which call
 completed.  To do this in a scalable manner required each vcore to have its own
 TAILQ of oustanding uthreads (blocked on syscalls).  The problem with this is
 that blocked syscalls are now tied to vcores, and that vcore might not want to
 yield even if it has no work to do (since it needs to receive its message).

 Ultimately, the issue is that complicated systems could benefit from not
 needing to handle overflow.  For certain event queue usages, the unbounded
 nature of UCQs make the system far simpler.  When we build other systems in
 userspace, such as using ev_qs for syscalls and uthread blocking, then we can
 leverage the UCQs.

 Note the main downfall of UCQs is that a process can die if it the kernel runs
 out of mememory while trying to send it messages.  If the kernel sends messages
 faster than the process can handle them (for whatever reason, including being
 swapped out), eventually we'll run out of memory.  UCQs need to be used
 carefully, such that the process isn't requesting an *unbounded* amount of
 messages at one time.

 The main benefit of UCQs in this situation is that they can handle spikes of
 messages and they remove the need to handle overflow.  Using BCQs, we'd need to
 handle overflow even if it was unlikely, and the impact of this extended beyond
 actually handling overflow.  Check out the old overflow handling code (and
 maybe some notes in the Documentation) for details about how we needed to do
 work for every syscall, in the off chance we had to handle overflow.

 2. How do they work?
 ====================
 2.1 Producer (kernel)
 ------------------------------------------------------------------
 Producers atomically fight for slots in prod_idx, which encode both the page
 and the msg number within the page.  If the slot is good, we just write our
 message.  If it isn't, things get interesting.  We need to synchronize, so we
 lock externally to the ucq.  This is where the hash lock comes in.  A bad slot
 is one in which there is no page or the message slot is greater than the array
 of msgs in the page.

 Whoever grabs the lock first will be responsible for getting a new page and
 resetting the counter (which tracks which slot is next).  All future lockers
 can tell that the work is done by examining the counter and trying to get a
 slot.  If they got a good one, everything is okay and it is just like they got
 a good slot in the first place (with the exception of a lock/unlock pair).

 One minor problem is that with sufficient producers and a delayed fixer (who
 holds the lock), we could overflow the prod_idx (around 3900 would overflow
 into a new page on the 0th slot).  To avoid this, the first producer to detect
 the slot/index is no good will set an overflow flag.  All future producers will
 check this flag before attempting to get a slot, and if we're overflowing, they
 will jump to the "grab lock" phase.  This limits the window of vulnerability.
 In theory, you could have 3900 producers at exactly the same time all fetch and
 add, before any of the others have a chance to set the overflow.  Not only does
 this require 3900 cores, but they all have to be writing to the exact same UCQ.
 If this happens, we have a warning that will go off and I'll fix it.

 So the last part to deal with is getting a new page and linking it with the old
 one.  We simply atomic_swap on the spare_pg.  If there is a spare already
 mmaped, we use it.  If there isn't, we need to mmap a new page.  Either way, we
 tell the old page to follow to the new page.  We set the index to a good value,
 then unlock (and clear the overflow flag).

 When we set the counter, we set it to 1, instead of 0, thereby reserving slot 0
 for ourselves.  This prevents a DoS from the program.  The user could muck with
 the prod_idx endlessly, in a way that seems benign.  To prevent this, we make
 sure that any time the process grabs the hashlock that it gets a good slot.
 Whenever we fail to get a slot we lock, and whenever we lock we get a good
 slot.

 Side note: another related DoS involves using CAS.  You can't CAS with
 userspace unless you're willing to fail.  O/W, you could be DoS'd.

 When we have a slot, we write in the ev_msg, and then toggle the 'ready' flag
 in the message container.  This is in case the consumer is trying to
 concurrently read our message while we are writing.

 And finally, every time the kernel reads or writes to something, we need to
 verify the address (which is user-supplied).  Eventually, the kernel can handle
 a page fault on a user address, but we can't tolerate KVAs in UCQs.  Note that
 old versions allowed UCQs/BCQs to be at KVAs (procdata), but that behavior will
 probably go away soon (error prone, we now handle user pointers anyway, and we
 usually load 'current' when dealing with a proc).


 2.2 Consumer
 ------------------------------------------------------------------
 Consumers CAS (compare-and-swap) on claiming a slot.  We need to use CAS, since
 we don't want to advance unless the queue is not empty.  If they detect the
 slot is bad, they fight among themselves for a lock, which is embedded in the
 UCQ, and the winner sets the counter to track the next page, frees extra pages,
 etc.

 Emptiness is defined as the prod_idx == cons_idx.  Both the producer's and the
 consumer's *next* slot is the same, so there is no items to be consumed.  If
 the prod_idx is bad, the consumer needs to wait til there is a good next page
 (the kernel is in the process of finding and filling a good slot).  If the
 cons_idx is bad, it means we need to go to the next page (lock first).

 When locking, the consumer does the same trick as the producer: try to get a
 slot again, in case someone else got the lock before you and fixed things up.

 So once we have the next page (it has been posted by the kernel), we can set up
 the cons_idx so consumers can grab slots.  We don't need to reserve a slot for
 ourselves (no DoS risk).

 After setting up the counter, our next job is to *free* the old page of
 consumed items.  The trick here is that there may be consumers still using it.
 We statically know how many items there are on the page, so we have all
 consumers increment another counter when they are done with the slot.  When
 that counter is the max, we know everyone is done with the page and it can be
 *freed*.  That counter is like an inverted reference count.  I don't care when
 it is 0, I care when it is the max.  Alternatively, we could have initialized
 it to MAX and decremented, but this way felt more natural.  When the page is
 done, we don't actually free it.  We'll atomic_swap it with the spare_pg,
 making it the spare.  If there already was a spare, then we have too many pages
 and need to munmap the old spare.

 So we finally get our good slot, and we spin til the kernel has loaded it with
 a message.  Then we just copy it out, increment the 'number consumed' counter,
 and we're done.

 3. Misc Other Notes:
 ====================
 I think most of these things were discussed inline.  There are a few random
 things worth mentioning, some of which are obvious in hindsight:
 - The kernel can't CAS and also not fail, o/w it can be DoSed.

 - Without CAS, the kernel can't atomically change the page and the index unless
   they are the same long / modifiable in the same atomic statement.  This is
 why the page and the counter/slot number are put together in prod_idx.

 - Kernel writers/producers need to stop/delay while another producer fixes
   things up.  The only acceptable place to delay/spin is on the proc's
 hashlock.  Any other scheme will probably fail, even if all you want to do is
 swap the spare page and reset the counter.It doesn't work for the same reason
 we have the current page and the index in one memory location.

 - The prod_idx and cons_idx are both protected by the lock and atomically
   incremented.  The lock protects who resets them, with the understanding that
 it should only be done when everyone agrees the counter is bad (too large), but
 the normal atomics happen lock-free.

 - Userspace should mmap a huge chunk of memory and then pass in page addresses
   to the ucq_init() function.  I made the init function so that this would be
 really easy.
	ucq.txt
	Barret Rhoden

	1. Overview
	====================
	1.1 What are they?
	------------------------------------------------------------------
	UCQs are a tool to send messages, with payloads, from the kernel to a process
	through shared memory. The depth of the queue is unlimited, barring running
	out of memory. They should be used in closed loop or low rate scenarios that
	require a payload (or that building a system that can handle missing a message
	is unwieldy).

	The old BCQs were designed for a shared memory segment, such as procdata.
	Since the kernel is sending the messages, we can use a more customized
	solution. The problems in this area are running out of memory, having the
	kernel use user-provided-pointers, and synchronizing within the kernel (like a
	spinlock for a TAILQ).

	The basic plan for these "unbounded concurrent queues" (UCQ) is to have linked
	mmap'd pages of arrays of event messages (the BCQ is a circular array of event
	messages, roughly). As the kernel over-produces, it mmaps more pages and links
	them together (via a header at the beginning of the page). Userspace munmaps
	when it is done with a page. To avoid excessive mmap/munmap, we double-buffer
	with two pages, one current and one spare. We only need to mmap new ones when
	the kernel gets far ahead of the user.

	- When we run out of room, the kernel will implicitly mmap another page,
	solving the memory allocation issue. The worst thing userspace can do is
	leak its memory, or cause an OOM condition, which we have to deal with anyway.

	- Using user-pointers isn't conceptually a problem any more, so long as the
	kernel reads it in and verifies the address is in userspace (and that it can
	handle page faults). This wasn't the case a couple years ago when I made the
	BCQs. We still are careful about pointers - we only use them when messing with
	which page is current, etc, and not when atomically adding items.

	- Swapping pages/buffers requires locking, but we can't put the lock in the UCQ
	structure, since userspace could muck with it. Instead, we lock at the
	process level. And instead of grabbing the process lock, we'll grab a hash
	lock (hash table of locks, hashed on the pointer of the UCQ). This will happen
	every ~170 events or so. Synchronization for normal ops (not buffer swaps) are
	done with atomics.

	Another option instead of allocating more memory would be to have the kernel
	block kthreads until the queue empties. I really dislike that for a couple
	reasons. It's easier to handle running out of memory than spawning too many
	kthreads, esp in critical parts of the code (where we can't sleep on memory
	allocation). More importantly, it's a real pain to block on a user's actions.
	For instance, I'm not about to put a semaphore in user-writable memory.

	1.2 Why do we have them?
	------------------------------------------------------------------
	Enqueuing messages in BCQs could fail, due to either a misbehaving process, an
	unlucky circumstance, or most commonly because the queue was full. Event
	queues, which used BCQs as a building block, would handle the failure as an
	'overflow' of events, and signal the process with a bit. This means the
	program needs to know how to handle overflow, which becomes painful.

	A specific case was syscall completion events. When a syscall was done, the
	kernel would send a message that a particular syscall was done. Userspace
	needed to know exactly which one was done, under normal circumstances. With
	BCQs, the 2LS needed to know how to handle overflow, which means it needs to
	track every outstanding syscall so that it can poll to see which call
	completed. To do this in a scalable manner required each vcore to have its own
	TAILQ of oustanding uthreads (blocked on syscalls). The problem with this is
	that blocked syscalls are now tied to vcores, and that vcore might not want to
	yield even if it has no work to do (since it needs to receive its message).

	Ultimately, the issue is that complicated systems could benefit from not
	needing to handle overflow. For certain event queue usages, the unbounded
	nature of UCQs make the system far simpler. When we build other systems in
	userspace, such as using ev_qs for syscalls and uthread blocking, then we can
	leverage the UCQs.

	Note the main downfall of UCQs is that a process can die if it the kernel runs
	out of mememory while trying to send it messages. If the kernel sends messages
	faster than the process can handle them (for whatever reason, including being
	swapped out), eventually we'll run out of memory. UCQs need to be used
	carefully, such that the process isn't requesting an unbounded amount of
	messages at one time.

	The main benefit of UCQs in this situation is that they can handle spikes of
	messages and they remove the need to handle overflow. Using BCQs, we'd need to
	handle overflow even if it was unlikely, and the impact of this extended beyond
	actually handling overflow. Check out the old overflow handling code (and
	maybe some notes in the Documentation) for details about how we needed to do
	work for every syscall, in the off chance we had to handle overflow.

	2. How do they work?
	====================
	2.1 Producer (kernel)
	------------------------------------------------------------------
	Producers atomically fight for slots in prod_idx, which encode both the page
	and the msg number within the page. If the slot is good, we just write our
	message. If it isn't, things get interesting. We need to synchronize, so we
	lock externally to the ucq. This is where the hash lock comes in. A bad slot
	is one in which there is no page or the message slot is greater than the array
	of msgs in the page.

	Whoever grabs the lock first will be responsible for getting a new page and
	resetting the counter (which tracks which slot is next). All future lockers
	can tell that the work is done by examining the counter and trying to get a
	slot. If they got a good one, everything is okay and it is just like they got
	a good slot in the first place (with the exception of a lock/unlock pair).

	One minor problem is that with sufficient producers and a delayed fixer (who
	holds the lock), we could overflow the prod_idx (around 3900 would overflow
	into a new page on the 0th slot). To avoid this, the first producer to detect
	the slot/index is no good will set an overflow flag. All future producers will
	check this flag before attempting to get a slot, and if we're overflowing, they
	will jump to the "grab lock" phase. This limits the window of vulnerability.
	In theory, you could have 3900 producers at exactly the same time all fetch and
	add, before any of the others have a chance to set the overflow. Not only does
	this require 3900 cores, but they all have to be writing to the exact same UCQ.
	If this happens, we have a warning that will go off and I'll fix it.

	So the last part to deal with is getting a new page and linking it with the old
	one. We simply atomic_swap on the spare_pg. If there is a spare already
	mmaped, we use it. If there isn't, we need to mmap a new page. Either way, we
	tell the old page to follow to the new page. We set the index to a good value,
	then unlock (and clear the overflow flag).

	When we set the counter, we set it to 1, instead of 0, thereby reserving slot 0
	for ourselves. This prevents a DoS from the program. The user could muck with
	the prod_idx endlessly, in a way that seems benign. To prevent this, we make
	sure that any time the process grabs the hashlock that it gets a good slot.
	Whenever we fail to get a slot we lock, and whenever we lock we get a good
	slot.

	Side note: another related DoS involves using CAS. You can't CAS with
	userspace unless you're willing to fail. O/W, you could be DoS'd.

	When we have a slot, we write in the ev_msg, and then toggle the 'ready' flag
	in the message container. This is in case the consumer is trying to
	concurrently read our message while we are writing.

	And finally, every time the kernel reads or writes to something, we need to
	verify the address (which is user-supplied). Eventually, the kernel can handle
	a page fault on a user address, but we can't tolerate KVAs in UCQs. Note that
	old versions allowed UCQs/BCQs to be at KVAs (procdata), but that behavior will
	probably go away soon (error prone, we now handle user pointers anyway, and we
	usually load 'current' when dealing with a proc).


	2.2 Consumer
	------------------------------------------------------------------
	Consumers CAS (compare-and-swap) on claiming a slot. We need to use CAS, since
	we don't want to advance unless the queue is not empty. If they detect the
	slot is bad, they fight among themselves for a lock, which is embedded in the
	UCQ, and the winner sets the counter to track the next page, frees extra pages,
	etc.

	Emptiness is defined as the prod_idx == cons_idx. Both the producer's and the
	consumer's next slot is the same, so there is no items to be consumed. If
	the prod_idx is bad, the consumer needs to wait til there is a good next page
	(the kernel is in the process of finding and filling a good slot). If the
	cons_idx is bad, it means we need to go to the next page (lock first).

	When locking, the consumer does the same trick as the producer: try to get a
	slot again, in case someone else got the lock before you and fixed things up.

	So once we have the next page (it has been posted by the kernel), we can set up
	the cons_idx so consumers can grab slots. We don't need to reserve a slot for
	ourselves (no DoS risk).

	After setting up the counter, our next job is to free the old page of
	consumed items. The trick here is that there may be consumers still using it.
	We statically know how many items there are on the page, so we have all
	consumers increment another counter when they are done with the slot. When
	that counter is the max, we know everyone is done with the page and it can be
	freed. That counter is like an inverted reference count. I don't care when
	it is 0, I care when it is the max. Alternatively, we could have initialized
	it to MAX and decremented, but this way felt more natural. When the page is
	done, we don't actually free it. We'll atomic_swap it with the spare_pg,
	making it the spare. If there already was a spare, then we have too many pages
	and need to munmap the old spare.

	So we finally get our good slot, and we spin til the kernel has loaded it with
	a message. Then we just copy it out, increment the 'number consumed' counter,
	and we're done.

	3. Misc Other Notes:
	====================
	I think most of these things were discussed inline. There are a few random
	things worth mentioning, some of which are obvious in hindsight:
	- The kernel can't CAS and also not fail, o/w it can be DoSed.

	- Without CAS, the kernel can't atomically change the page and the index unless
	they are the same long / modifiable in the same atomic statement. This is
	why the page and the counter/slot number are put together in prod_idx.

	- Kernel writers/producers need to stop/delay while another producer fixes
	things up. The only acceptable place to delay/spin is on the proc's
	hashlock. Any other scheme will probably fail, even if all you want to do is
	swap the spare page and reset the counter.It doesn't work for the same reason
	we have the current page and the index in one memory location.

	- The prod_idx and cons_idx are both protected by the lock and atomically
	incremented. The lock protects who resets them, with the understanding that
	it should only be done when everyone agrees the counter is bad (too large), but
	the normal atomics happen lock-free.

	- Userspace should mmap a huge chunk of memory and then pass in page addresses
	to the ucq_init() function. I made the init function so that this would be
	really easy.