| ucq.txt |
| Barret Rhoden |
| |
| 1. Overview |
| ==================== |
| 1.1 What are they? |
| ------------------------------------------------------------------ |
| UCQs are a tool to send messages, with payloads, from the kernel to a process |
| through shared memory. The depth of the queue is unlimited, barring running |
| out of memory. They should be used in closed loop or low rate scenarios that |
| require a payload (or that building a system that can handle missing a message |
| is unwieldy). |
| |
| The old BCQs were designed for a shared memory segment, such as procdata. |
| Since the kernel is sending the messages, we can use a more customized |
| solution. The problems in this area are running out of memory, having the |
| kernel use user-provided-pointers, and synchronizing within the kernel (like a |
| spinlock for a TAILQ). |
| |
| The basic plan for these "unbounded concurrent queues" (UCQ) is to have linked |
| mmap'd pages of arrays of event messages (the BCQ is a circular array of event |
| messages, roughly). As the kernel over-produces, it mmaps more pages and links |
| them together (via a header at the beginning of the page). Userspace munmaps |
| when it is done with a page. To avoid excessive mmap/munmap, we double-buffer |
| with two pages, one current and one spare. We only need to mmap new ones when |
| the kernel gets far ahead of the user. |
| |
| - When we run out of room, the kernel will implicitly mmap another page, |
| solving the memory allocation issue. The worst thing userspace can do is |
| leak its memory, or cause an OOM condition, which we have to deal with anyway. |
| |
| - Using user-pointers isn't conceptually a problem any more, so long as the |
| kernel reads it in and verifies the address is in userspace (and that it can |
| handle page faults). This wasn't the case a couple years ago when I made the |
| BCQs. We still are careful about pointers - we only use them when messing with |
| which page is current, etc, and not when atomically adding items. |
| |
| - Swapping pages/buffers requires locking, but we can't put the lock in the UCQ |
| structure, since userspace could muck with it. Instead, we lock at the |
| process level. And instead of grabbing the process lock, we'll grab a hash |
| lock (hash table of locks, hashed on the pointer of the UCQ). This will happen |
| every ~170 events or so. Synchronization for normal ops (not buffer swaps) are |
| done with atomics. |
| |
| Another option instead of allocating more memory would be to have the kernel |
| block kthreads until the queue empties. I really dislike that for a couple |
| reasons. It's easier to handle running out of memory than spawning too many |
| kthreads, esp in critical parts of the code (where we can't sleep on memory |
| allocation). More importantly, it's a real pain to block on a user's actions. |
| For instance, I'm not about to put a semaphore in user-writable memory. |
| |
| 1.2 Why do we have them? |
| ------------------------------------------------------------------ |
| Enqueuing messages in BCQs could fail, due to either a misbehaving process, an |
| unlucky circumstance, or most commonly because the queue was full. Event |
| queues, which used BCQs as a building block, would handle the failure as an |
| 'overflow' of events, and signal the process with a bit. This means the |
| program needs to know how to handle overflow, which becomes painful. |
| |
| A specific case was syscall completion events. When a syscall was done, the |
| kernel would send a message that a particular syscall was done. Userspace |
| needed to know exactly which one was done, under normal circumstances. With |
| BCQs, the 2LS needed to know how to handle overflow, which means it needs to |
| track every outstanding syscall so that it can poll to see which call |
| completed. To do this in a scalable manner required each vcore to have its own |
| TAILQ of oustanding uthreads (blocked on syscalls). The problem with this is |
| that blocked syscalls are now tied to vcores, and that vcore might not want to |
| yield even if it has no work to do (since it needs to receive its message). |
| |
| Ultimately, the issue is that complicated systems could benefit from not |
| needing to handle overflow. For certain event queue usages, the unbounded |
| nature of UCQs make the system far simpler. When we build other systems in |
| userspace, such as using ev_qs for syscalls and uthread blocking, then we can |
| leverage the UCQs. |
| |
| Note the main downfall of UCQs is that a process can die if it the kernel runs |
| out of mememory while trying to send it messages. If the kernel sends messages |
| faster than the process can handle them (for whatever reason, including being |
| swapped out), eventually we'll run out of memory. UCQs need to be used |
| carefully, such that the process isn't requesting an *unbounded* amount of |
| messages at one time. |
| |
| The main benefit of UCQs in this situation is that they can handle spikes of |
| messages and they remove the need to handle overflow. Using BCQs, we'd need to |
| handle overflow even if it was unlikely, and the impact of this extended beyond |
| actually handling overflow. Check out the old overflow handling code (and |
| maybe some notes in the Documentation) for details about how we needed to do |
| work for every syscall, in the off chance we had to handle overflow. |
| |
| 2. How do they work? |
| ==================== |
| 2.1 Producer (kernel) |
| ------------------------------------------------------------------ |
| Producers atomically fight for slots in prod_idx, which encode both the page |
| and the msg number within the page. If the slot is good, we just write our |
| message. If it isn't, things get interesting. We need to synchronize, so we |
| lock externally to the ucq. This is where the hash lock comes in. A bad slot |
| is one in which there is no page or the message slot is greater than the array |
| of msgs in the page. |
| |
| Whoever grabs the lock first will be responsible for getting a new page and |
| resetting the counter (which tracks which slot is next). All future lockers |
| can tell that the work is done by examining the counter and trying to get a |
| slot. If they got a good one, everything is okay and it is just like they got |
| a good slot in the first place (with the exception of a lock/unlock pair). |
| |
| One minor problem is that with sufficient producers and a delayed fixer (who |
| holds the lock), we could overflow the prod_idx (around 3900 would overflow |
| into a new page on the 0th slot). To avoid this, the first producer to detect |
| the slot/index is no good will set an overflow flag. All future producers will |
| check this flag before attempting to get a slot, and if we're overflowing, they |
| will jump to the "grab lock" phase. This limits the window of vulnerability. |
| In theory, you could have 3900 producers at exactly the same time all fetch and |
| add, before any of the others have a chance to set the overflow. Not only does |
| this require 3900 cores, but they all have to be writing to the exact same UCQ. |
| If this happens, we have a warning that will go off and I'll fix it. |
| |
| So the last part to deal with is getting a new page and linking it with the old |
| one. We simply atomic_swap on the spare_pg. If there is a spare already |
| mmaped, we use it. If there isn't, we need to mmap a new page. Either way, we |
| tell the old page to follow to the new page. We set the index to a good value, |
| then unlock (and clear the overflow flag). |
| |
| When we set the counter, we set it to 1, instead of 0, thereby reserving slot 0 |
| for ourselves. This prevents a DoS from the program. The user could muck with |
| the prod_idx endlessly, in a way that seems benign. To prevent this, we make |
| sure that any time the process grabs the hashlock that it gets a good slot. |
| Whenever we fail to get a slot we lock, and whenever we lock we get a good |
| slot. |
| |
| Side note: another related DoS involves using CAS. You can't CAS with |
| userspace unless you're willing to fail. O/W, you could be DoS'd. |
| |
| When we have a slot, we write in the ev_msg, and then toggle the 'ready' flag |
| in the message container. This is in case the consumer is trying to |
| concurrently read our message while we are writing. |
| |
| And finally, every time the kernel reads or writes to something, we need to |
| verify the address (which is user-supplied). Eventually, the kernel can handle |
| a page fault on a user address, but we can't tolerate KVAs in UCQs. Note that |
| old versions allowed UCQs/BCQs to be at KVAs (procdata), but that behavior will |
| probably go away soon (error prone, we now handle user pointers anyway, and we |
| usually load 'current' when dealing with a proc). |
| |
| |
| 2.2 Consumer |
| ------------------------------------------------------------------ |
| Consumers CAS (compare-and-swap) on claiming a slot. We need to use CAS, since |
| we don't want to advance unless the queue is not empty. If they detect the |
| slot is bad, they fight among themselves for a lock, which is embedded in the |
| UCQ, and the winner sets the counter to track the next page, frees extra pages, |
| etc. |
| |
| Emptiness is defined as the prod_idx == cons_idx. Both the producer's and the |
| consumer's *next* slot is the same, so there is no items to be consumed. If |
| the prod_idx is bad, the consumer needs to wait til there is a good next page |
| (the kernel is in the process of finding and filling a good slot). If the |
| cons_idx is bad, it means we need to go to the next page (lock first). |
| |
| When locking, the consumer does the same trick as the producer: try to get a |
| slot again, in case someone else got the lock before you and fixed things up. |
| |
| So once we have the next page (it has been posted by the kernel), we can set up |
| the cons_idx so consumers can grab slots. We don't need to reserve a slot for |
| ourselves (no DoS risk). |
| |
| After setting up the counter, our next job is to *free* the old page of |
| consumed items. The trick here is that there may be consumers still using it. |
| We statically know how many items there are on the page, so we have all |
| consumers increment another counter when they are done with the slot. When |
| that counter is the max, we know everyone is done with the page and it can be |
| *freed*. That counter is like an inverted reference count. I don't care when |
| it is 0, I care when it is the max. Alternatively, we could have initialized |
| it to MAX and decremented, but this way felt more natural. When the page is |
| done, we don't actually free it. We'll atomic_swap it with the spare_pg, |
| making it the spare. If there already was a spare, then we have too many pages |
| and need to munmap the old spare. |
| |
| So we finally get our good slot, and we spin til the kernel has loaded it with |
| a message. Then we just copy it out, increment the 'number consumed' counter, |
| and we're done. |
| |
| 3. Misc Other Notes: |
| ==================== |
| I think most of these things were discussed inline. There are a few random |
| things worth mentioning, some of which are obvious in hindsight: |
| - The kernel can't CAS and also not fail, o/w it can be DoSed. |
| |
| - Without CAS, the kernel can't atomically change the page and the index unless |
| they are the same long / modifiable in the same atomic statement. This is |
| why the page and the counter/slot number are put together in prod_idx. |
| |
| - Kernel writers/producers need to stop/delay while another producer fixes |
| things up. The only acceptable place to delay/spin is on the proc's |
| hashlock. Any other scheme will probably fail, even if all you want to do is |
| swap the spare page and reset the counter.It doesn't work for the same reason |
| we have the current page and the index in one memory location. |
| |
| - The prod_idx and cons_idx are both protected by the lock and atomically |
| incremented. The lock protects who resets them, with the understanding that |
| it should only be done when everyone agrees the counter is bad (too large), but |
| the normal atomics happen lock-free. |
| |
| - Userspace should mmap a huge chunk of memory and then pass in page addresses |
| to the ucq_init() function. I made the init function so that this would be |
| really easy. |