| ucq.txt | 
 | Barret Rhoden | 
 |  | 
 | 1. Overview | 
 | ==================== | 
 | 1.1 What are they? | 
 | ------------------------------------------------------------------ | 
 | UCQs are a tool to send messages, with payloads, from the kernel to a process | 
 | through shared memory.  The depth of the queue is unlimited, barring running | 
 | out of memory.  They should be used in closed loop or low rate scenarios that | 
 | require a payload (or that building a system that can handle missing a message | 
 | is unwieldy). | 
 |  | 
 | The old BCQs were designed for a shared memory segment, such as procdata. | 
 | Since the kernel is sending the messages, we can use a more customized | 
 | solution.  The problems in this area are running out of memory, having the | 
 | kernel use user-provided-pointers, and synchronizing within the kernel (like a | 
 | spinlock for a TAILQ).  | 
 |  | 
 | The basic plan for these "unbounded concurrent queues" (UCQ) is to have linked | 
 | mmap'd pages of arrays of event messages (the BCQ is a circular array of event | 
 | messages, roughly).  As the kernel over-produces, it mmaps more pages and links | 
 | them together (via a header at the beginning of the page).  Userspace munmaps | 
 | when it is done with a page.  To avoid excessive mmap/munmap, we double-buffer | 
 | with two pages, one current and one spare.  We only need to mmap new ones when | 
 | the kernel gets far ahead of the user. | 
 |  | 
 | - When we run out of room, the kernel will implicitly mmap another page, | 
 |   solving the memory allocation issue.  The worst thing userspace can do is | 
 | leak its memory, or cause an OOM condition, which we have to deal with anyway. | 
 |  | 
 | - Using user-pointers isn't conceptually a problem any more, so long as the | 
 |   kernel reads it in and verifies the address is in userspace (and that it can | 
 | handle page faults).  This wasn't the case a couple years ago when I made the | 
 | BCQs.  We still are careful about pointers - we only use them when messing with | 
 | which page is current, etc, and not when atomically adding items. | 
 |  | 
 | - Swapping pages/buffers requires locking, but we can't put the lock in the UCQ | 
 |   structure, since userspace could muck with it.  Instead, we lock at the | 
 | process level.  And instead of grabbing the process lock, we'll grab a hash | 
 | lock (hash table of locks, hashed on the pointer of the UCQ).  This will happen | 
 | every ~170 events or so.  Synchronization for normal ops (not buffer swaps) are | 
 | done with atomics. | 
 |  | 
 | Another option instead of allocating more memory would be to have the kernel | 
 | block kthreads until the queue empties.  I really dislike that for a couple | 
 | reasons.  It's easier to handle running out of memory than spawning too many | 
 | kthreads, esp in critical parts of the code (where we can't sleep on memory | 
 | allocation).  More importantly, it's a real pain to block on a user's actions. | 
 | For instance, I'm not about to put a semaphore in user-writable memory. | 
 |  | 
 | 1.2 Why do we have them? | 
 | ------------------------------------------------------------------ | 
 | Enqueuing messages in BCQs could fail, due to either a misbehaving process, an | 
 | unlucky circumstance, or most commonly because the queue was full.  Event | 
 | queues, which used BCQs as a building block, would handle the failure as an | 
 | 'overflow' of events, and signal the process with a bit.  This means the | 
 | program needs to know how to handle overflow, which becomes painful. | 
 |  | 
 | A specific case was syscall completion events.  When a syscall was done, the | 
 | kernel would send a message that a particular syscall was done.  Userspace | 
 | needed to know exactly which one was done, under normal circumstances.  With | 
 | BCQs, the 2LS needed to know how to handle overflow, which means it needs to | 
 | track every outstanding syscall so that it can poll to see which call | 
 | completed.  To do this in a scalable manner required each vcore to have its own | 
 | TAILQ of oustanding uthreads (blocked on syscalls).  The problem with this is | 
 | that blocked syscalls are now tied to vcores, and that vcore might not want to | 
 | yield even if it has no work to do (since it needs to receive its message). | 
 |  | 
 | Ultimately, the issue is that complicated systems could benefit from not | 
 | needing to handle overflow.  For certain event queue usages, the unbounded | 
 | nature of UCQs make the system far simpler.  When we build other systems in | 
 | userspace, such as using ev_qs for syscalls and uthread blocking, then we can | 
 | leverage the UCQs. | 
 |  | 
 | Note the main downfall of UCQs is that a process can die if it the kernel runs | 
 | out of mememory while trying to send it messages.  If the kernel sends messages | 
 | faster than the process can handle them (for whatever reason, including being | 
 | swapped out), eventually we'll run out of memory.  UCQs need to be used | 
 | carefully, such that the process isn't requesting an *unbounded* amount of | 
 | messages at one time. | 
 |  | 
 | The main benefit of UCQs in this situation is that they can handle spikes of | 
 | messages and they remove the need to handle overflow.  Using BCQs, we'd need to | 
 | handle overflow even if it was unlikely, and the impact of this extended beyond | 
 | actually handling overflow.  Check out the old overflow handling code (and | 
 | maybe some notes in the Documentation) for details about how we needed to do | 
 | work for every syscall, in the off chance we had to handle overflow. | 
 |  | 
 | 2. How do they work? | 
 | ==================== | 
 | 2.1 Producer (kernel) | 
 | ------------------------------------------------------------------ | 
 | Producers atomically fight for slots in prod_idx, which encode both the page | 
 | and the msg number within the page.  If the slot is good, we just write our | 
 | message.  If it isn't, things get interesting.  We need to synchronize, so we | 
 | lock externally to the ucq.  This is where the hash lock comes in.  A bad slot | 
 | is one in which there is no page or the message slot is greater than the array | 
 | of msgs in the page. | 
 |  | 
 | Whoever grabs the lock first will be responsible for getting a new page and | 
 | resetting the counter (which tracks which slot is next).  All future lockers | 
 | can tell that the work is done by examining the counter and trying to get a | 
 | slot.  If they got a good one, everything is okay and it is just like they got | 
 | a good slot in the first place (with the exception of a lock/unlock pair). | 
 |  | 
 | One minor problem is that with sufficient producers and a delayed fixer (who | 
 | holds the lock), we could overflow the prod_idx (around 3900 would overflow | 
 | into a new page on the 0th slot).  To avoid this, the first producer to detect | 
 | the slot/index is no good will set an overflow flag.  All future producers will | 
 | check this flag before attempting to get a slot, and if we're overflowing, they | 
 | will jump to the "grab lock" phase.  This limits the window of vulnerability. | 
 | In theory, you could have 3900 producers at exactly the same time all fetch and | 
 | add, before any of the others have a chance to set the overflow.  Not only does | 
 | this require 3900 cores, but they all have to be writing to the exact same UCQ. | 
 | If this happens, we have a warning that will go off and I'll fix it. | 
 |  | 
 | So the last part to deal with is getting a new page and linking it with the old | 
 | one.  We simply atomic_swap on the spare_pg.  If there is a spare already | 
 | mmaped, we use it.  If there isn't, we need to mmap a new page.  Either way, we | 
 | tell the old page to follow to the new page.  We set the index to a good value, | 
 | then unlock (and clear the overflow flag). | 
 |  | 
 | When we set the counter, we set it to 1, instead of 0, thereby reserving slot 0 | 
 | for ourselves.  This prevents a DoS from the program.  The user could muck with | 
 | the prod_idx endlessly, in a way that seems benign.  To prevent this, we make | 
 | sure that any time the process grabs the hashlock that it gets a good slot. | 
 | Whenever we fail to get a slot we lock, and whenever we lock we get a good | 
 | slot. | 
 |  | 
 | Side note: another related DoS involves using CAS.  You can't CAS with | 
 | userspace unless you're willing to fail.  O/W, you could be DoS'd. | 
 |  | 
 | When we have a slot, we write in the ev_msg, and then toggle the 'ready' flag | 
 | in the message container.  This is in case the consumer is trying to | 
 | concurrently read our message while we are writing. | 
 |  | 
 | And finally, every time the kernel reads or writes to something, we need to | 
 | verify the address (which is user-supplied).  Eventually, the kernel can handle | 
 | a page fault on a user address, but we can't tolerate KVAs in UCQs.  Note that | 
 | old versions allowed UCQs/BCQs to be at KVAs (procdata), but that behavior will | 
 | probably go away soon (error prone, we now handle user pointers anyway, and we | 
 | usually load 'current' when dealing with a proc). | 
 |  | 
 |  | 
 | 2.2 Consumer | 
 | ------------------------------------------------------------------ | 
 | Consumers CAS (compare-and-swap) on claiming a slot.  We need to use CAS, since | 
 | we don't want to advance unless the queue is not empty.  If they detect the | 
 | slot is bad, they fight among themselves for a lock, which is embedded in the | 
 | UCQ, and the winner sets the counter to track the next page, frees extra pages, | 
 | etc. | 
 |  | 
 | Emptiness is defined as the prod_idx == cons_idx.  Both the producer's and the | 
 | consumer's *next* slot is the same, so there is no items to be consumed.  If | 
 | the prod_idx is bad, the consumer needs to wait til there is a good next page | 
 | (the kernel is in the process of finding and filling a good slot).  If the | 
 | cons_idx is bad, it means we need to go to the next page (lock first). | 
 |  | 
 | When locking, the consumer does the same trick as the producer: try to get a | 
 | slot again, in case someone else got the lock before you and fixed things up. | 
 |  | 
 | So once we have the next page (it has been posted by the kernel), we can set up | 
 | the cons_idx so consumers can grab slots.  We don't need to reserve a slot for | 
 | ourselves (no DoS risk). | 
 |  | 
 | After setting up the counter, our next job is to *free* the old page of | 
 | consumed items.  The trick here is that there may be consumers still using it. | 
 | We statically know how many items there are on the page, so we have all | 
 | consumers increment another counter when they are done with the slot.  When | 
 | that counter is the max, we know everyone is done with the page and it can be | 
 | *freed*.  That counter is like an inverted reference count.  I don't care when | 
 | it is 0, I care when it is the max.  Alternatively, we could have initialized | 
 | it to MAX and decremented, but this way felt more natural.  When the page is | 
 | done, we don't actually free it.  We'll atomic_swap it with the spare_pg, | 
 | making it the spare.  If there already was a spare, then we have too many pages | 
 | and need to munmap the old spare. | 
 |  | 
 | So we finally get our good slot, and we spin til the kernel has loaded it with | 
 | a message.  Then we just copy it out, increment the 'number consumed' counter, | 
 | and we're done. | 
 |  | 
 | 3. Misc Other Notes: | 
 | ==================== | 
 | I think most of these things were discussed inline.  There are a few random | 
 | things worth mentioning, some of which are obvious in hindsight: | 
 | - The kernel can't CAS and also not fail, o/w it can be DoSed. | 
 |  | 
 | - Without CAS, the kernel can't atomically change the page and the index unless | 
 |   they are the same long / modifiable in the same atomic statement.  This is | 
 | why the page and the counter/slot number are put together in prod_idx. | 
 |  | 
 | - Kernel writers/producers need to stop/delay while another producer fixes | 
 |   things up.  The only acceptable place to delay/spin is on the proc's | 
 | hashlock.  Any other scheme will probably fail, even if all you want to do is | 
 | swap the spare page and reset the counter.It doesn't work for the same reason | 
 | we have the current page and the index in one memory location. | 
 |  | 
 | - The prod_idx and cons_idx are both protected by the lock and atomically | 
 |   incremented.  The lock protects who resets them, with the understanding that | 
 | it should only be done when everyone agrees the counter is bad (too large), but | 
 | the normal atomics happen lock-free. | 
 |  | 
 | - Userspace should mmap a huge chunk of memory and then pass in page addresses | 
 |   to the ucq_init() function.  I made the init function so that this would be | 
 | really easy. |