blob: 9e3fb96930056532870b04f80ef93b5110c632dc [file] [log] [blame]
ucq.txt
Barret Rhoden
1. Overview
====================
1.1 What are they?
------------------------------------------------------------------
UCQs are a tool to send messages, with payloads, from the kernel to a process
through shared memory. The depth of the queue is unlimited, barring running
out of memory. They should be used in closed loop or low rate scenarios that
require a payload (or that building a system that can handle missing a message
is unwieldy).
The old BCQs were designed for a shared memory segment, such as procdata.
Since the kernel is sending the messages, we can use a more customized
solution. The problems in this area are running out of memory, having the
kernel use user-provided-pointers, and synchronizing within the kernel (like a
spinlock for a TAILQ).
The basic plan for these "unbounded concurrent queues" (UCQ) is to have linked
mmap'd pages of arrays of event messages (the BCQ is a circular array of event
messages, roughly). As the kernel over-produces, it mmaps more pages and links
them together (via a header at the beginning of the page). Userspace munmaps
when it is done with a page. To avoid excessive mmap/munmap, we double-buffer
with two pages, one current and one spare. We only need to mmap new ones when
the kernel gets far ahead of the user.
- When we run out of room, the kernel will implicitly mmap another page,
solving the memory allocation issue. The worst thing userspace can do is
leak its memory, or cause an OOM condition, which we have to deal with anyway.
- Using user-pointers isn't conceptually a problem any more, so long as the
kernel reads it in and verifies the address is in userspace (and that it can
handle page faults). This wasn't the case a couple years ago when I made the
BCQs. We still are careful about pointers - we only use them when messing with
which page is current, etc, and not when atomically adding items.
- Swapping pages/buffers requires locking, but we can't put the lock in the UCQ
structure, since userspace could muck with it. Instead, we lock at the
process level. And instead of grabbing the process lock, we'll grab a hash
lock (hash table of locks, hashed on the pointer of the UCQ). This will happen
every ~170 events or so. Synchronization for normal ops (not buffer swaps) are
done with atomics.
Another option instead of allocating more memory would be to have the kernel
block kthreads until the queue empties. I really dislike that for a couple
reasons. It's easier to handle running out of memory than spawning too many
kthreads, esp in critical parts of the code (where we can't sleep on memory
allocation). More importantly, it's a real pain to block on a user's actions.
For instance, I'm not about to put a semaphore in user-writable memory.
1.2 Why do we have them?
------------------------------------------------------------------
Enqueuing messages in BCQs could fail, due to either a misbehaving process, an
unlucky circumstance, or most commonly because the queue was full. Event
queues, which used BCQs as a building block, would handle the failure as an
'overflow' of events, and signal the process with a bit. This means the
program needs to know how to handle overflow, which becomes painful.
A specific case was syscall completion events. When a syscall was done, the
kernel would send a message that a particular syscall was done. Userspace
needed to know exactly which one was done, under normal circumstances. With
BCQs, the 2LS needed to know how to handle overflow, which means it needs to
track every outstanding syscall so that it can poll to see which call
completed. To do this in a scalable manner required each vcore to have its own
TAILQ of oustanding uthreads (blocked on syscalls). The problem with this is
that blocked syscalls are now tied to vcores, and that vcore might not want to
yield even if it has no work to do (since it needs to receive its message).
Ultimately, the issue is that complicated systems could benefit from not
needing to handle overflow. For certain event queue usages, the unbounded
nature of UCQs make the system far simpler. When we build other systems in
userspace, such as using ev_qs for syscalls and uthread blocking, then we can
leverage the UCQs.
Note the main downfall of UCQs is that a process can die if it the kernel runs
out of mememory while trying to send it messages. If the kernel sends messages
faster than the process can handle them (for whatever reason, including being
swapped out), eventually we'll run out of memory. UCQs need to be used
carefully, such that the process isn't requesting an *unbounded* amount of
messages at one time.
The main benefit of UCQs in this situation is that they can handle spikes of
messages and they remove the need to handle overflow. Using BCQs, we'd need to
handle overflow even if it was unlikely, and the impact of this extended beyond
actually handling overflow. Check out the old overflow handling code (and
maybe some notes in the Documentation) for details about how we needed to do
work for every syscall, in the off chance we had to handle overflow.
2. How do they work?
====================
2.1 Producer (kernel)
------------------------------------------------------------------
Producers atomically fight for slots in prod_idx, which encode both the page
and the msg number within the page. If the slot is good, we just write our
message. If it isn't, things get interesting. We need to synchronize, so we
lock externally to the ucq. This is where the hash lock comes in. A bad slot
is one in which there is no page or the message slot is greater than the array
of msgs in the page.
Whoever grabs the lock first will be responsible for getting a new page and
resetting the counter (which tracks which slot is next). All future lockers
can tell that the work is done by examining the counter and trying to get a
slot. If they got a good one, everything is okay and it is just like they got
a good slot in the first place (with the exception of a lock/unlock pair).
One minor problem is that with sufficient producers and a delayed fixer (who
holds the lock), we could overflow the prod_idx (around 3900 would overflow
into a new page on the 0th slot). To avoid this, the first producer to detect
the slot/index is no good will set an overflow flag. All future producers will
check this flag before attempting to get a slot, and if we're overflowing, they
will jump to the "grab lock" phase. This limits the window of vulnerability.
In theory, you could have 3900 producers at exactly the same time all fetch and
add, before any of the others have a chance to set the overflow. Not only does
this require 3900 cores, but they all have to be writing to the exact same UCQ.
If this happens, we have a warning that will go off and I'll fix it.
So the last part to deal with is getting a new page and linking it with the old
one. We simply atomic_swap on the spare_pg. If there is a spare already
mmaped, we use it. If there isn't, we need to mmap a new page. Either way, we
tell the old page to follow to the new page. We set the index to a good value,
then unlock (and clear the overflow flag).
When we set the counter, we set it to 1, instead of 0, thereby reserving slot 0
for ourselves. This prevents a DoS from the program. The user could muck with
the prod_idx endlessly, in a way that seems benign. To prevent this, we make
sure that any time the process grabs the hashlock that it gets a good slot.
Whenever we fail to get a slot we lock, and whenever we lock we get a good
slot.
Side note: another related DoS involves using CAS. You can't CAS with
userspace unless you're willing to fail. O/W, you could be DoS'd.
When we have a slot, we write in the ev_msg, and then toggle the 'ready' flag
in the message container. This is in case the consumer is trying to
concurrently read our message while we are writing.
And finally, every time the kernel reads or writes to something, we need to
verify the address (which is user-supplied). Eventually, the kernel can handle
a page fault on a user address, but we can't tolerate KVAs in UCQs. Note that
old versions allowed UCQs/BCQs to be at KVAs (procdata), but that behavior will
probably go away soon (error prone, we now handle user pointers anyway, and we
usually load 'current' when dealing with a proc).
2.2 Consumer
------------------------------------------------------------------
Consumers CAS (compare-and-swap) on claiming a slot. We need to use CAS, since
we don't want to advance unless the queue is not empty. If they detect the
slot is bad, they fight among themselves for a lock, which is embedded in the
UCQ, and the winner sets the counter to track the next page, frees extra pages,
etc.
Emptiness is defined as the prod_idx == cons_idx. Both the producer's and the
consumer's *next* slot is the same, so there is no items to be consumed. If
the prod_idx is bad, the consumer needs to wait til there is a good next page
(the kernel is in the process of finding and filling a good slot). If the
cons_idx is bad, it means we need to go to the next page (lock first).
When locking, the consumer does the same trick as the producer: try to get a
slot again, in case someone else got the lock before you and fixed things up.
So once we have the next page (it has been posted by the kernel), we can set up
the cons_idx so consumers can grab slots. We don't need to reserve a slot for
ourselves (no DoS risk).
After setting up the counter, our next job is to *free* the old page of
consumed items. The trick here is that there may be consumers still using it.
We statically know how many items there are on the page, so we have all
consumers increment another counter when they are done with the slot. When
that counter is the max, we know everyone is done with the page and it can be
*freed*. That counter is like an inverted reference count. I don't care when
it is 0, I care when it is the max. Alternatively, we could have initialized
it to MAX and decremented, but this way felt more natural. When the page is
done, we don't actually free it. We'll atomic_swap it with the spare_pg,
making it the spare. If there already was a spare, then we have too many pages
and need to munmap the old spare.
So we finally get our good slot, and we spin til the kernel has loaded it with
a message. Then we just copy it out, increment the 'number consumed' counter,
and we're done.
3. Misc Other Notes:
====================
I think most of these things were discussed inline. There are a few random
things worth mentioning, some of which are obvious in hindsight:
- The kernel can't CAS and also not fail, o/w it can be DoSed.
- Without CAS, the kernel can't atomically change the page and the index unless
they are the same long / modifiable in the same atomic statement. This is
why the page and the counter/slot number are put together in prod_idx.
- Kernel writers/producers need to stop/delay while another producer fixes
things up. The only acceptable place to delay/spin is on the proc's
hashlock. Any other scheme will probably fail, even if all you want to do is
swap the spare page and reset the counter.It doesn't work for the same reason
we have the current page and the index in one memory location.
- The prod_idx and cons_idx are both protected by the lock and atomically
incremented. The lock protects who resets them, with the understanding that
it should only be done when everyone agrees the counter is bad (too large), but
the normal atomics happen lock-free.
- Userspace should mmap a huge chunk of memory and then pass in page addresses
to the ucq_init() function. I made the init function so that this would be
really easy.