| async_events.txt | 
 | Barret Rhoden | 
 |  | 
 | 1. Overview | 
 | 2. Async Syscalls and I/O | 
 | 3. Event Delivery / Notification | 
 | 4. Single-core Process (SCP) Events | 
 | 5. Misc Things That Aren't Sorted Completely: | 
 |  | 
 | 1. Overview | 
 | ==================== | 
 | 1.1 Event Handling / Notifications / Async IO Issues: | 
 | ------------------------------------------------------------------ | 
 | Basically, syscalls use the ROS event delivery mechanisms, redefined and | 
 | described below.  Syscalls use the event delivery just like any other | 
 | subsystem would that wants to deliver messages to a process.  The only other | 
 | example we have right now are the "kernel notifications", which are the | 
 | one-sided, kernel-initiated messages that the kernel sends to a process. | 
 |  | 
 | Overall, there are several analogies from how vcores work to how the OS | 
 | handles interrupts.  This is a result of trying to make vcores run like | 
 | virtual multiprocessors, in control of their resources and aware of the lower | 
 | levels of the system.  This analogy has guided much of how the vcore layer | 
 | works.  Whenever we have issues with the 2-lsched, realize the amount of | 
 | control they want means using solutions that the OS must do too. | 
 |  | 
 | Note that there is some pointer chasing going on, though we try to keep it to | 
 | a minimum.  Any time the kernel chases a pointer, it needs to make sure it is | 
 | in the R/W section of userspace, though it doesn't need to check if the page | 
 | is present.  There's more info in the Page Fault sections of the | 
 | documentation.  (Briefly, if the kernel PFs on a user address, it will either | 
 | block and handle the PF, or if the address was unmapped, it will kill the | 
 | process). | 
 |  | 
 | 1.2 Some Definitions: | 
 | --------------------------------------- | 
 | ev_q, event_queue, event_q: all terms used interchangeably with each other. | 
 | They are the endpoint for communicating messages to a process, encapsulating | 
 | the method of delivery (such as IPI or not) with where to save the message. | 
 |  | 
 | Vcore context: the execution context of the virtual core on the "trampoline" | 
 | stack.  All executions start from the top of this stack, and no stack state is | 
 | saved between vcore_entry() calls.  All executions on here are non-blocking, | 
 | notifications (IPIs) are disabled, and there is a specific TLS loaded.  Vcore | 
 | context is used for running the second level scheduler (2LS), swapping between | 
 | threads, and handling notifications.  It is analagous to "interrupt context" | 
 | in the OS.  Any functions called from here should be brief.  Any memory | 
 | touched must be pinned.  In Lithe terms, vcore context might be called the | 
 | Hart / hard thread.  People often wonder if they can run out of vcore context | 
 | directly.  Technically, you can, but you lose the ability to take any fault | 
 | (page fault) or to get IPIs for notification.  In essence, you lose control, | 
 | analgous to running an application in the kernel with preemption/interrupts | 
 | disabled.  See the process documentation for more info. | 
 |  | 
 | 2LS: is the second level scheduler/framework.  This code executes in vcore | 
 | context, and is Lithe / plugs in to Lithe (eventually).  Often used | 
 | interchangeably with "vcore context", usually when I want to emphasize the | 
 | scheduling nature of the code. | 
 |  | 
 | VCPD: "virtual core preemption data".  In procdata, there is an array of | 
 | struct preempt_data, one per vcore.  This is the default location to look for | 
 | all things related to the management of vcores, such as its event_mbox (queue | 
 | of incoming messages/notifications/events).  Both the kernel and the vcore | 
 | code know to look here for a variety of things. | 
 |  | 
 | Vcore-business: This is a term I use for a class of messages where the receiver | 
 | is the actual vcore, and not just using the vcore as a place to receive the | 
 | message.  Examples of vcore-business are INDIR events, preempt_pending events, | 
 | scheduling events (self-ipis by the 2LS from one vcore to another), and things | 
 | like that.  There are two types: public and private.  Private will only be | 
 | handled by that vcore.  Public might be handled by another vcore. | 
 |  | 
 | Notif_table: This is a list of event_q*s that correspond to certain | 
 | unexpected/"one-sided" events the kernel sends to the process.  It is similar | 
 | to an IRQ table in the kernel.  Each event_q tells the kernel how the process | 
 | wants to be told about the specific event type. | 
 |  | 
 | Notifications: used to be a generic event, but now used in terms of the verb | 
 | 'notify' (do_notify()).  In older docs, passive notification is just writing a | 
 | message somewhere.  Active notification is an IPI delivered to a vcore.  I use | 
 | that term interchangeably with an IPI, and usually you can tell by context | 
 | that I'm talking about an IPI going to a process (and not just the kernel). | 
 | The details of it make it more complicated than just an IPI, but it's | 
 | analagous.  I've start referring to notification as the IPI, and "passive | 
 | notification" as just events, though older documentation has both meanings. | 
 |  | 
 | BCQ: "bounded concurrent queue".  It is a fixed size array of messages | 
 | (structs of notification events, or whatever).  It is non-blocking, supporting | 
 | multiple producers and consumers, where the producers do not trust the | 
 | consumers.  It is the primary mechanism for the kernel delivering message | 
 | payloads into a process's address space.  Note that producers don't trust each | 
 | other either (in the event of weirdness, the producers give up and say the | 
 | buffer is full).  This means that a process can produce for one of its ev_qs | 
 | (which is what they need to do to send message to itself). | 
 |  | 
 | UCQ: "unbounded concurrent queue".  This is a data structure allowing the kernel | 
 | to produce an unbounded number of messages for the process to consume.  The main | 
 | limitation to the number of messages is RAM.  Check out its documentation. | 
 |  | 
 | 2. Async Syscalls and I/O | 
 | ==================== | 
 | 2.1 Basics | 
 | ---------------------------------------------- | 
 | The syscall struct is the contract for work with the kernel, including async | 
 | I/O.  Lots of current OS async packages use epoll or other polling systems. | 
 | Note the distinction between Polling and Async I/O.  Polling is about finding | 
 | out if a call will block.  It is primarily used for sockets and pipes.  It | 
 | does relatively nothing for disk I/O, which requires a separate async I/O | 
 | system.  By having all syscalls be async, we can make polling a bit easier and | 
 | more unified with the generic event code that we use for all syscalls. | 
 |  | 
 | For instance, we can have a sys_poll syscall, which is async just like any | 
 | other syscall.  The call can be a "one shot / non-blocking", like the current | 
 | systems polling code, or it can also notify on change (not requiring future | 
 | polls) via the event_q mechanisms.  If you don't want to be IPId, you can | 
 | "poll" the syscall struct - not requiring another kernel crossing/syscall. | 
 |  | 
 | Note that we do not tie syscalls and polling to FDs.  We do events on | 
 | syscalls, which can be used to check FDs.  I think a bunch of polling cases | 
 | will not be needed once we have async syscalls, but for those that remain, | 
 | we'll have sys_poll() (or whatever). | 
 |  | 
 | To receive an event on a syscall completion or status change, just fill in the | 
 | event_q pointer.  If it is 0, the kernel will assume you poll the actual | 
 | syscall struct. | 
 |  | 
 | struct syscall { | 
 | 	current stuff 		/* arguments, retvals */ | 
 | 	struct ev_queue * 	/* struct used for messaging, including IPIs*/ | 
 | 	void * 			/* used by 2LS, usually a struct u_thread * */ | 
 | } | 
 |  | 
 | One issue with async syscalls is that there can be too many outstanding IOs | 
 | (normally sync calls provide feedback / don't allow you to over-request). | 
 | Eventually, processes can exhaust kernel memory (the kthreads, specifically). | 
 | We need a way to limit the kthreads per proc, etc.  Shouldn't be a big deal. | 
 |  | 
 | Normally, we talk about changing the flag in a syscall to SC_DONE.  Async | 
 | syscalls can be SC_PROGRESS (new stuff happened on it), which can trigger a | 
 | notification event.  Some calls, like AIO or bulk accept, exist for a while | 
 | and slowly get filled in / completed.  In the future, we'll also want a way to | 
 | abort the in-progress syscalls (possibly any syscall!). | 
 |  | 
 | 2.2 Uthreads Blocking on Syscalls | 
 | ---------------------------------------------- | 
 | Many threading libraries will want some notion of a synchronous, blocking | 
 | thread.  These threads use regular I/O calls, which are async under the hood, | 
 | but don't want to bother with call backs or other details of async I/O.  In | 
 | this section, I'll talk a bit about how this works, esp regarding | 
 | uthreads/pthreads. | 
 |  | 
 | 'Blocking' refers to user threads, and has nothing to do with an actual | 
 | process blocking/waiting on some kernel event.  The kernel does not know | 
 | anything about what goes on here.  While a bit confusing, this allows | 
 | applications to do whatever they want on top of an async interface, and is a | 
 | consequence of decoupling cores from user-threads from kthreads. | 
 |  | 
 | 2.2.1 Basics of Uthread Blocking | 
 | --------------- | 
 | When a thread calls a glibc function that makes a system call, if the syscall | 
 | is not yet complete when the kernel returns to userspace, glibc will check for | 
 | the existence of a second level scheduler and attempt to use it to yield its | 
 | uthread.  If there is no 2LS, the code just spins for now.  Eventually, it | 
 | will try to suspend/yield the process for a while (til the call is done), aka, | 
 | block in the kernel. | 
 |  | 
 | If there is a 2LS, the current thread will yield, and call out to the 2LS's | 
 | blockon_sysc() method, which needs a way to stop the thread and be able to | 
 | restart it when the syscall completes.  Specifically, the pthread 2LS registers | 
 | the syscall to respond to an event (described in detail elsewhere in this doc). | 
 | When the event comes in, meaning the syscall is complete, the thread is put on | 
 | the runnable list. | 
 |  | 
 | Details: | 
 | - A pointer to the struct pthread is stored in the syscall's void*.  When the | 
 |   syscall is done, we normally get a message from the kernel, and the payload | 
 |   tells us the syscall is done, which tells us which thread to unblock.  | 
 | - The pthread code also always asks for an IPI and event message for every | 
 |   syscall that completes.  This is far from ideal.  Still, the basics are the | 
 |   same for any threading library.  Once you know a thread is done, you need to | 
 |   do something about it. | 
 | - The pthread code does syscall blocking and event notification on a per-core | 
 |   basis.  Using the default (VCPD) ev_mbox for this is a bad idea (which we did | 
 |   at some point). | 
 | - There's a race between the 2LS trying to sign up for events and the kernel | 
 |   finishing the event.  We handle this in uthread code, so use the helper to | 
 |   register_evq(), which does the the right thing (atomics, careful ordering | 
 |   with writes, etc). | 
 |  | 
 | 2.2.1 Recovering from Event Overflow | 
 | --------------- | 
 | Event overflow recovery is unnecessary, since syscall ev_qs use UCQs now.  this | 
 | section is kept around for some useful tidbits, such as details about | 
 | deregistering ev_qs for a syscall: | 
 |  | 
 | --------------------------- | 
 | The pthread code expects to receive an event somehow to unblock a thread | 
 | once its syscall is done.  One limitation to our messaging systems is that you | 
 | can't send an infinite amount of event messages.  (By messages, I mean a chunk | 
 | of memory with a payload, in this case consisting of a struct syscall *). | 
 | Event delivery degrades to a bit in the case of the message queue being full | 
 | (more details on that later). | 
 |  | 
 | The pthread code (and any similar 2LS) needs to handle waking up syscalls when | 
 | the event message was lost and all we know is that some syscall that was meant | 
 | to have a message sent to a particular event queue (per-core in the case of | 
 | pthread stuff (actually the VCPD for now)).  The basic idea is to poll all | 
 | outstanding system calls and unblock whoever is done. | 
 |  | 
 | The key problem is due to a race: for a given syscall we don't know if we're | 
 | going to get a message for a syscall or not.  There could be a completion | 
 | message in the queue for the syscall while we are going through the list of | 
 | blocked threads.  If we assume we already got the message (or it was lost in | 
 | the overflow), but didn't really, then if we finish as SC and free its memory | 
 | (free or return up the stack), we could later get a message for it, and all | 
 | sorts of things would go wrong (like trying to unblock a pointer that is | 
 | gibberish). | 
 |  | 
 | Here's what we do: | 
 | 1) Set a "handling overflow" flag so we don't recurse. | 
 | 2) Turn off event delivery for all syscalls on our list | 
 | 3) Handle any event messages.  This is how we make a distinction between | 
 | finished syscalls that had a message sent and those that didn't.  We're doing | 
 | the message-sent ones here. | 
 | 4) For any left on the list, check to see if they are done.  We actually do | 
 | this by attempting to turn on event delivery for them.  Turning on event | 
 | delivery can fail if the call is already done.  So if it fails, they are done | 
 | and we unblock them (similar to how we block the threads in the first place). | 
 | If it doesn't fail, they are now ready to receive messages.  This can be | 
 | tweaked a bit. | 
 | 5) Unset the overflow-handling flag. | 
 |  | 
 | One thing to be careful of is that when we turn off event delivery, you need to | 
 | be sure the kernel isn't in the process of sending an event.  This is why we | 
 | have the SC_K_LOCK syscall flag.  Uthread code will not consider deregistration | 
 | complete while that flag is set, since the kernel is still mucking with the | 
 | syscall (and sending an event).  Once the flag is clear, the event has been | 
 | delivered (the ev_msg is in the ev_mbox), and our assumptions remain true. | 
 |  | 
 | There are a couple implications of this style.  If you have a shared event | 
 | queue (with other event sources), those events can get mixed in with the | 
 | recovery.  Don't leave the vcore context due to other events.  This'll | 
 | probably need work.  The other thing is that completed syscalls can get | 
 | handled in a different order than they were signaled.  Shouldn't be a big | 
 | deal. | 
 |  | 
 | Note on the overflow handling flag and unsetting it.  There should not be any | 
 | races with this.  The flag prevented us from handling overflows on the event | 
 | queue.  Other than when we checked for events that had been succesfully sent, | 
 | we didn't try to handle events.  We can unset the flag, and at that point we | 
 | can start handling missed events.  If there was an overflow after we last | 
 | checked the list, but before we cleared the overflow-handling flag, we'll | 
 | still catch it since we haven't tried handling events in between checking the | 
 | list and clearing the flag.  That flag doesn't even matter until we want to | 
 | handle_events, so we aren't missing anything.  the next handle_events() will | 
 | deal with everything from scratch. | 
 |  | 
 | For blocking threads that block concurrently with the overflow handling: in | 
 | the pthread case, this can't happen since everything is per-vcore.  If you do | 
 | have process-wide thread blocking/syscall management, we can add new ones, but | 
 | they must have event delivery turned off when they are added to the list.  And | 
 | you'll need to lock the list, etc.  This should work in part due to new | 
 | syscalls being added to the end of the list, and the overflow-handler | 
 | proceeding linearly through the list. | 
 |  | 
 | Also note that we shouldn't handle the event for unblocking a syscall on a | 
 | different core than the one it was submitted to.  This could result in | 
 | concurrent modifications to the original core's TAILQ (bad).  This restriction | 
 | is dependent on how a 2LS does its thread handling/blocking. | 
 |  | 
 | Eventually, we'll want a way to detect and handle excessive overflow, since | 
 | it's probably quite expensive.  Perhaps turn it off and periodically poll the | 
 | syscalls for completion (but don't bother turning on the ev_q). | 
 | --------------------------- | 
 |  | 
 | 3. Event Delivery / Notification | 
 | ==================== | 
 | 3.1 Basics | 
 | ---------------------------------------------- | 
 | The mbox (mailbox) is where the actual messages go. | 
 |  | 
 | struct ev_mbox { | 
 | 	bcq of notif_events 	/* bounded buffer, multi-consumer/producer */ | 
 | 	msg_bitmap | 
 | } | 
 | struct ev_queue {		/* aka, event_q, ev_q, etc. */ | 
 | 	struct ev_mbox * | 
 | 	void handler(struct event_q *) | 
 | 	vcore_to_be_told | 
 | 	flags 			/* IPI_WANTED, RR, 2L-handle-it, etc */ | 
 | } | 
 | struct ev_queue_big { | 
 | 	struct ev_mbox *	/* pointing to the internal storage */ | 
 | 	vcore_to_be_told | 
 | 	flags 			/* IPI_WANTED, RR, 2L-handle-it, etc */ | 
 | 	struct ev_mbox { }	/* never access this directly */ | 
 | } | 
 |  | 
 | The purpose of the big one is to simply embed some storage.  Still, only | 
 | access the mbox via the pointer.  The big one can be casted (and stored as) | 
 | the regular, so long as you know to dealloc a big one (free() knows, custom | 
 | styles or slabs would need some help). | 
 |  | 
 | The ev_mbox says where to put the actual message, and the flags handle things | 
 | such as whether or not an IPI is wanted. | 
 |  | 
 | Using pointers for the ev_q like this allows multiple event queues to use the | 
 | same mbox.  For example, we could use the vcpd queue for both kernel-generated | 
 | events as well as async syscall responses.  The notification table is actually | 
 | a bunch of ev_qs, many of which could be pointing to the same vcore/vcpd-mbox, | 
 | albeit with different flags. | 
 |  | 
 | 3.2 Kernel Notification Using Event Queues | 
 | ---------------------------------------------- | 
 | The notif_tbl/notif_methods (kernel-generated 'one-sided' events) is just an | 
 | array of struct ev_queue*s.  Handling a notification is like any other time | 
 | when we want to send an event.  Follow a pointer, send a message, etc.  As | 
 | with all ev_qs, ev_mbox* points to where you want the message for the event, | 
 | which usually is the vcpd's mbox.  If the ev_q pointer is 0, then we know the | 
 | process doesn't want the event (equivalent to the older 'NOTIF_WANTED' flag). | 
 | Theoretically, we can send kernel notifs to user threads.  While it isn't | 
 | clear that anyone will ever want this, it is possible (barring other issues), | 
 | since they are just events. | 
 |  | 
 | Also note the flag EVENT_VCORE_APPRO.  Processes should set this for certain | 
 | types of events where they want the kernel to send the event/IPI to the | 
 | 'appropriate' vcore.  For example, when sending a message about a preemption | 
 | coming in, it makes sense for the kernel to send it to the vcore that is going | 
 | to get preempted, but the application could choose to ignore the notification. | 
 | When this flag is set, the kernel will also use the vcore's ev_mbox, ignoring | 
 | the process's choice.  We can change this later, but it doesn't really make | 
 | sense for a process to pick an mbox and also say VCORE_APPRO. | 
 |  | 
 | There are also interfaces in the kernel to put a message in an ev_mbox | 
 | regardless of the process's wishes (post_vcore_event()), and can send an IPI | 
 | at any time (proc_notify()). | 
 |  | 
 | 3.3 IPIs, Indirection Events, and Fallback (Spamming Indirs) | 
 | ---------------------------------------------- | 
 | An ev_q can ask for an IPI, for an indirection event, and for an indirection | 
 | event to be spammed in case a vcore is offline (sometimes called the 'fallback' | 
 | option.  Or any combination of these.  Note that these have little to do with | 
 | the actual message being sent.  The actual message is dropped in the ev_mbox | 
 | pointed to by the ev_q. | 
 |  | 
 | The main use for all of this is for syscalls.  If you want to receive an event | 
 | when a syscall completes or has a change in status, simply allocate an event_q, | 
 | and point the syscall at it.  syscall: ev_q* -> "vcore for IPI, syscall message | 
 | in the ev_q mbox", etc.  You can also point it to an existing ev_q.  Pthread | 
 | code has examples of two ways to do this.  Both have per vcore ev_qs, requesting | 
 | IPIs, INDIRS, and SPAM_INDIR.  One way is to have an ev_mbox per vcore, and | 
 | another is to have a global ev_mbox that all ev_qs point to.  As a side note, if | 
 | you do the latter, you don't need to worry about a vcore's ev_q if it gets | 
 | preempted: just check the global ev_mbox (which is done by checking your own | 
 | vcore's syscall ev_q). | 
 |  | 
 | 3.3.1: IPIs and INDIRs | 
 | --------------- | 
 | An EVENT_IPI simply means we'll send an IPI to the given vcore.  Nothing else. | 
 | This will usually be paired with an Indirection event (EVENT_INDIR).  An INDIR | 
 | is a message of type EV_EVENT with an ev_q* payload.  It means "check this | 
 | ev_q".  Most ev_qs that ask for an IPI will also want an INDIR so that the vcore | 
 | knows why it was IPIed.  You don't have to do this: for instance, your 2LS might | 
 | poll its own ev_q, so you won't need the indirection event. | 
 |  | 
 | Additionally, note that IPIs and INDIRs can be spurious.  It's not a big deal to | 
 | receive and IPI and have nothing to do, or to be told to check an empty ev_q. | 
 | All of the event handling code can deal with this. | 
 |  | 
 | INDIR events are sent to the VCPD public mbox, which means they will get handled | 
 | if the vcore gets preempted.  Any other messages sent here will also get handled | 
 | during a preemption.  However, the only type of messages you should use this for | 
 | are ones that can handle spurious messages.  The completion of a syscall is an | 
 | example of a message that cannot be spurious.  Since INDIRs can be spurious, we | 
 | can use the public mbox.  (Side note: the kernel may spam INDIRs in attempting | 
 | to make sure you get the message on a vcore that didn't yield.) | 
 |  | 
 | Never use a VCPD mbox (public or private) for messages you might want to receive | 
 | if that vcore is offline.  If you want to be sure to get a message, create your | 
 | own ev_q and set flags for INDIR, SPAM_INDIR, and IPI.  There's no guarantee a | 
 | *specific* message will get looked at.  In cases where it won't, the kernel will | 
 | send that message to another vcore.  For example, if the kernel posts an INDIR | 
 | to a VCPD mbox (the public one btw) and it loses a race with the vcore yielding, | 
 | the vcore might never see that message.  However, the kernel knows it lost the | 
 | race, and will find another vcore to send it to. | 
 |  | 
 | 3.3.2: Spamming Indirs / Fallback | 
 | --------------- | 
 | Both IPI and INDIR need an actual vcore.  If that vcore is unavailable and if | 
 | EVENT_SPAM_INDIR is set, the kernel will pick another vcore and send the | 
 | messages there.  This allows an ev_q to be set up to handle work when the vcore | 
 | is online, while allowing the program to handle events when that core yields, | 
 | without having to reset all of its ev_qs to point to "known" available vcores | 
 | (and avoiding those races).  Note 'online' is synonymous with 'mapped', when | 
 | talking about vcores.  A vcore technically isn't always online, only destined | 
 | to be online, when it is mapped to a pcore (kmsg on the way, etc).  It's | 
 | easiest to think of it being online for the sake of this discussion. | 
 |  | 
 | One question is whether or not 2LSs need a SPAM_INDIR flag for their ev_qs. | 
 | The main use for SPAM_INDIR is so that vcores can yield.  (Note that fallback | 
 | won't help you *miss* INDIR messages in the event of a preemption; you can | 
 | always lose that race due to it taking too long to process the messages).  An | 
 | alternative would be for vcores to pick another vcore and change all of its | 
 | ev_qs to that vcore.  There are a couple problems with this.  One is that it'll | 
 | be a pain to get those ev_qs back when the vcore comes back online (if ever). | 
 | Another issue is that other vcores will build up a list of ev_qs that they | 
 | aren't aware of, which will be hard to deal with when *they* yield.  SPAM_INDIR | 
 | avoids all of those problems. | 
 |  | 
 | An important aspect of spamming indirs is that it works with yielded vcores, | 
 | not preempted vcores.  It could be that there are no cores that are online, but | 
 | there should always be at least one core that *will* be online in the future, a | 
 | core that the process didn't want to lose and will deal with in the future.  If | 
 | not for this distinction, SPAM_INDIR could fail.  An older idea would be to have | 
 | fallback send the msg to the desired vcore if there were no others.  This would | 
 | not work if the vcore yielded and then the entire process was preempted or | 
 | otherwise not running.  Another way to put this is that we need a field to | 
 | determine whether a vcore is offline temporarily or permanently. | 
 |  | 
 | This is why we have the VCPD flag 'VC_CAN_RCV_MSG'.  It tells the kernel's event | 
 | delivery code that the vcore will check the messages: it is an acceptable | 
 | destination for a spammed indir.  There are two reasons to put this in VCPD: | 
 | 1) Userspace can remotely turn off a vcore's msg reception.  This is necessary | 
 | for handling preemption of a vcore that was in uthread context, so that we can | 
 | remotely 'yield' the core without having to sys_change_vcore() (which I discuss | 
 | below, and is meant to 'unstick' a vcore). | 
 | 2) Yield is simplified.  The kernel no longer races with itself nor has to worry | 
 | about turning off that flag - userspace can do it when it wants to yield.  (turn | 
 | off the flag, check messages, then yield).  This is less big of a deal now that | 
 | the kernel races with vcore membership in the online_vcs list. | 
 |  | 
 | Two aspects of the code make this work nicely.  The VC_CAN_RCV_MSG flag greatly | 
 | simplifies the kernel's job.  There are a lot of weird races we'd have to deal | 
 | with, such as process state (RUNNING_M), whether a mass preempt is going on, or | 
 | just one core, or a bunch of cores, mass yields, etc.  A flag that does one | 
 | thing well helps a lot - esp since preemption is not the same as yielding.  The | 
 | other useful thing is being able to handle spurious events.  Vcore code can | 
 | handle extra IPIs and INDIRs to non-VCPD ev_qs.  Any vcore can handle an ev_q | 
 | that is "non-VCPD business". | 
 |  | 
 | Worth mentioning is the difference between 'notif_pending' and VC_CAN_RCV_MSG. | 
 | VC_CAN_RCV_MSG is the process saying it will check for messages. | 
 | 'notif_pending' is when the kernel says it *has* sent a message. | 
 | 'notif_pending' is also used by the kernel in proc_yield() and the 2LS in | 
 | pop_user_ctx() to make sure the sent message is not missed. | 
 |  | 
 | Also, in case this comes up, there's a slight race on changing the mbox* and the | 
 | vcore number within the event_q.  The message could have gone to the wrong (old) | 
 | vcore, but not the IPI.  Not a big deal - IPIs can be spurious, and the other | 
 | vcore will eventually get it.  The real way around this is create a new ev_q and | 
 | change the pointer (thus atomically changing the entire ev_q's contents), though | 
 | this can be a bit tricky if you have multiple places pointing to the same ev_q | 
 | (can't change them all at once). | 
 |  | 
 | 3.3.3: Fallback and Preemption | 
 | --------------- | 
 | SPAM_INDIR doesn't protect you from preemptions.  A vcore can be preempted and | 
 | have INDIRs in its VCPD. | 
 |  | 
 | It is tempting to just use sys_change_vcore(), which will change the calling | 
 | vcore to the new one.  This should only be used to "unstick" a vcore.  A vcore | 
 | is stuck when it was preempted while it had notifications disabled.  This is | 
 | usually when it is vcore context, but also in any lock holding code for locks | 
 | shared with vcore context (the userspace equivalent of irqsave locks).  With | 
 | this syscall, you could change to the offline vcore and process its INDIRs. | 
 |  | 
 | The problem with that plan is the calling core (that is trying to save the | 
 | other) may have extra messages, and that sys_change_vcore does not return.  We | 
 | need a way to deal with our other messages.  We're back to the same problem we | 
 | had before, just with different vcores.  The only thing we really accomplished | 
 | is that we unstuck the other vcore.  We could tell the restarted vcore (via an | 
 | event) to switch back to us, but by the time it does that, it may have other | 
 | events that got lost.  So we're back to polling the ev_qs that it might have | 
 | received INDIRs about.  Note that we still want to send an event with | 
 | sys_change_vcore().  We want the new vcore to know the old vcore was put | 
 | offline: a preemption (albeit one that it chose to do, and one that isn't stuck | 
 | in vcore context). | 
 |  | 
 | One older way to deal with this was to force the 2LS to deal with this. The 2LS | 
 | would check the ev_mboxes/ev_qs of all ev_qs that could send INDIRS to the | 
 | offline vcore.  There could be INDIRS in the VCPD that are just lying there. | 
 | The 2LS knows which ev_qs these are (such as for completed syscalls), and for | 
 | many things, this will be a common ev_q (such as for 'vcore-x-was-preempted'). | 
 | However, this is a huge pain in the ass, since a preempted vcore could have the | 
 | spammed INDIR for an ev_q associated with another vcore.  To deal with this, | 
 | the 2LS would need to check *every* ev_q that requests INDIRs.  We don't do | 
 | this. | 
 |  | 
 | Instead, we simply have the remote core check the VCPD public mbox of the | 
 | preempted vcore.  INDIRs (and other vcore business that other vcores can handle) | 
 | will get sorted here. | 
 |  | 
 | 3.3.5: Lists to Find Vcores | 
 | --------------- | 
 | A process has three lists: online, bulk_preempt, and inactive.  These not only | 
 | are good for process management, but also for helping alert_vcore() find | 
 | potentially alertable vcores.  alert_vcore() and its associated helpers are | 
 | failry complicated and heavily commented.  I've set things up so both the | 
 | online_vcs and the bulk_preempted_vcs lists can be handled the same way: post to | 
 | the first element, then see if it still VC_CAN_RCV_MSG.  If not, if it is still | 
 | the first on the list, then it hasn't proc_yield()ed yet, and it will eventually | 
 | restart when it tries to yield.  And this all works without locking the | 
 | proc_lock.  There are a bunch more details and races avoided.  Check the code | 
 | out. | 
 |  | 
 | 3.3.6: Vcore Business and the VCPD mboxs | 
 | --------------- | 
 | There are two types of VCPD mboxes: public and private.  Public ones will get | 
 | handled during preemption recovery.  Messages sent here need to be handle-able | 
 | by any vcore.  Private messages are for that specific vcore.  In the common | 
 | case, the public mbox will usually only get looked at by its vcore.  Only during | 
 | recovery and some corner cases will we deal with it remotely. | 
 |  | 
 | Here's some guidelines: if you message is spammy and the handler can deal with | 
 | spurious events and it doesn't need to be on a specific vcore, then go with | 
 | public.  Examples of public mbox events are ones that need to be spammed: | 
 | preemption recovery, INDIRs, etc.  Note that you won't need to worry about | 
 | these: uthread code and the kernel handle them.  But if you have something | 
 | similar, then that's where it would go.  You can also send non-spammy things, | 
 | but there's no guarantee they'll be looked at. | 
 |  | 
 | Some messages should only be sent to the private mbox.  These include ones that | 
 | make no sense for other vcores to handle.  Examples: 2LS IPIs/preemptions (like | 
 | "change your scheduling policy vcore 3", preemption-pending notifs from the | 
 | kernel, timer interrupts, etc. | 
 |  | 
 | An example of something that shouldn't be sent to either is syscall completions. | 
 | They can't be spammed, so you can't send them around like INDIRs.  And they need | 
 | to be dealt with.  Other than carefully-spammed public messages, there's no | 
 | guarantee of getting a message for certain scenarios (yields).  Instead, use an | 
 | ev_q with INDIR set. | 
 |  | 
 | Also note that a 2LS could set up a big ev_q with EVENT_IPI and not EVENT_INDIR, | 
 | and then poll for that in their vcore_entry().  This is equivalent to setting up | 
 | a small ev_q with EVENT_IPI and pointing it at the private mbox. | 
 |  | 
 | 3.4 Application-specific Event Handling | 
 | --------------------------------------- | 
 | So what happens when the vcore/2LS isn't handling an event queue, but has been | 
 | "told" about it?  This "telling" is in the form of an IPI.  The vcore was | 
 | prodded, but is not supposed to handle the event.  This is actually what | 
 | happens now in Linux when you send signals for AIO.  It's all about who (which | 
 | thread, in their world) is being interrupted to process the work in an | 
 | application specific way.  The app sets the handler, with the option to have a | 
 | thread spawned (instead of a sighandler), etc. | 
 |  | 
 | This is not exactly the same as the case above where the ev_mbox* pointed to | 
 | the vcore's default mbox.  That issue was just about avoiding extra messages | 
 | (and messages in weird orders).  A vcore won't handle an ev_q if the | 
 | message/contents of the queue aren't meant for the vcore/2LS.  For example, a | 
 | thread can want to run its own handler, perhaps because it performs its own | 
 | asynchronous I/O (compared to relying on the 2LS to schedule synchronous | 
 | blocking u_threads). | 
 |  | 
 | There are a couple ways to handle this.  Ultimately, the application is supposed | 
 | to handle the event.  If it asked for an IPI, it is because something ought to | 
 | be done, which really means running a handler.  We used to support the | 
 | application setting EVENT_THREAD in the ev_q's flags, and the 2LS would spawn a | 
 | thread to run the ev_q's handler.  Now we just have the application block a | 
 | uthread on the evq.  If an ev_handler is set, the vcore will execute the | 
 | handler itself.  Careful with this, since the only memory it touches must be | 
 | pinned, the function must not block (this is only true for the handlers called | 
 | directly out of vcore context), and it should return quickly. | 
 |  | 
 | Note that in either case, vcore-written code (library code) does not look at | 
 | the contents of the notification event.  Also note the handler takes the whole | 
 | event_queue, and not a specific message.  It is more flexible, can handle | 
 | multiple specific events, and doesn't require the vcore code to dequeue the | 
 | event and either pass by value or allocate more memory. | 
 |  | 
 | These ev_q handlers are different than ev_handlers.  The former handles an | 
 | event_queue.  The latter is the 2LS's way to handle specific types of messages. | 
 | If an app wants to process specific messages, have them sent to an ev_q under | 
 | its control; don't mess with ev_handlers unless you're the 2LS (or example | 
 | code). | 
 |  | 
 | Continuing the analogy between vcores getting IPIs and the OS getting HW | 
 | interrupts, what goes on in vcore context is like what goes on in interrupt | 
 | context, and the threaded handler is like running a threaded interrupt handler | 
 | (in Linux).  In the ROS world, it is like having the interrupt handler kick | 
 | off a kernel message to defer the work out of interrupt context. | 
 |  | 
 | If neither of the application-specific handling flags are set, the vcore will | 
 | respond to the IPI by attempting to handle the event on its own (lookup table | 
 | based on the type of event (like "syscall complete")).  If you didn't want the | 
 | vcore to handle it, then you shouldn't have asked for an IPI.  Those flags are | 
 | the means by which the vcore can distinguish between its event_qs and the | 
 | applications.  It does not make sense otherwise to send the vcore an IPI and | 
 | an event_q, but not tell give the code the info it needs to handle it. | 
 |  | 
 | In the future, we might have the ability to block a u_thread on an event_q, so | 
 | we'll have other EV_ flags to express this, and probably a void*.  This may | 
 | end up being redudant, since u_threads will be able to block on syscalls (and | 
 | not necessarily IPIs sent to vcores). | 
 |  | 
 | As a side note, a vcore can turn off the IPI wanted flag at any time.  For | 
 | instance, when it spawns a thread to handle an ev_q, the vcore can turn off | 
 | IPI wanted on that event_q, and the thread handler can turn it back on when it | 
 | is done processing and wants to be re-IPId.  The reason for this is to avoid | 
 | taking future IPIs (once we leave vcore context, IPIs are enabled) to let us | 
 | know about an event for which a handler is already running. | 
 |  | 
 | 3.5 Overflowed/Missed Messages in the VCPD  | 
 | --------------------------------------- | 
 | This too is no longer necessary.  It's useful in that it shows what we don't | 
 | have to put up with.  Missing messages requires potentially painful | 
 | infrastructure to handle it: | 
 |  | 
 | ----------------------------- | 
 | All event_q's requesting IPIs ought to register with the 2LS.  This is for | 
 | recovering in case the vcpd's mbox overflowed, and the vcore knows it missed a | 
 | NE_EVENT type message.  At that point, it would have to check all of its | 
 | IPI-based queues.  To do so, it could check to see if the mbox has any | 
 | messages, though in all likelihood, we'll just act as if there was a message | 
 | on each of the queues (all such handlers should be able to handle spurious | 
 | IPIs anyways).  This is analagous to how the OS's block drivers don't solely | 
 | rely on receiving an interrupt (they deal with it via timeouts).  Any user | 
 | code requiring an IPI must do this.  Any code that runs better due to getting | 
 | the IPI ought to do this. | 
 |  | 
 | We could imagine having a thread spawned to handle an ev_q, and the vcore | 
 | never has to touch the ev_q (which might make it easier for memory | 
 | allocation).  This isn't a great idea, but I'll still explain it.  In the | 
 | notif_ev message sent to the vcore, it has the event_q*.  We could also send a | 
 | flag with the same info as in the event_q's flags, and also send the handler. | 
 | The problem with this is that it isn't resilient to failure.  If there was a | 
 | message overflow, it would have the check the event_q (which was registered | 
 | before) anyway, and could potentially page fault there.  Also the kernel would | 
 | have faulted on it (and read it in) back when it tried to read those values. | 
 | It's somewhat moot, since we're going to have an allocator that pins event_qs. | 
 | ----------------------------- | 
 |  | 
 | 3.6 Round-Robin or Other IPI-delivery styles | 
 | --------------------------------------- | 
 | In the same way that the IOAPIC can deliver interrupts to a group of cores, | 
 | round-robinning between them, so can we imagine processes wanting to | 
 | distribute the IPI/active notification of events across its vcores.  This is | 
 | only meaningful is the NOTIF_IPI_WANTED flag is set. | 
 |  | 
 | Eventually we'll support this, via a flag in the event_q.  When | 
 | NE_ROUND_ROBIN, or whatever, is set a couple things will happen.  First, the | 
 | vcore field will be used in a "delivery-specific" manner.  In the case of RR, | 
 | it will probably be the most recent destination.  Perhaps it will be a bitmask | 
 | of vcores available to receive.  More important is the event_mbox*.  If it is | 
 | set, then the event message will be sent there.  Whichever vcore gets selected | 
 | will receive an IPI, and its vcpd mbox will get a NE_EVENT message.  If the | 
 | event_mbox* is 0, then the actual message will get delivered to the vcore's | 
 | vcpd mbox (the default location). | 
 |  | 
 | 3.7 Event_q-less Notifications | 
 | --------------------------------------- | 
 | Some events needs to be delivered directly to the vcore, regardless of any | 
 | event_qs.  This happens currently when we bypass the notification table (e.g., | 
 | sys_self_notify(), preemptions, etc).  These notifs will just use the vcore's | 
 | default mbox.  In essence, the ev_q is being generated/sent with the call. | 
 | The implied/fake ev_q points to the vcpd's mbox, with the given vcore set, and | 
 | with IPI_WANTED set.  It is tempting to make those functions take a | 
 | dynamically generated ev_q, though more likely we'll just use the lower level | 
 | functions in the kernel, much like the Round Robin set will need to do.  No | 
 | need to force things to fit just for the sake of using a 'solution'.  We want | 
 | tools to make solutions, not packaged solutions. | 
 |  | 
 | 3.8 UTHREAD_DONT_MIGRATE | 
 | --------------------------------------- | 
 | DONT_MIGRATE exists to allow uthreads to disable notifications/IPIs and enter | 
 | vcore context.  It is needed since you need to read vcoreid to disable notifs, | 
 | but once you read it, you need to not move to another vcore.  Here are a few | 
 | rules/guidelines. | 
 |  | 
 | We turn off the flag so that we can disable notifs, but turn the flag back on | 
 | before enabling.  The thread won't get migrated in that instant since notifs are | 
 | off.  But if it was the other way, we could miss a message (because we skipped | 
 | an opportunity to be dropped into vcore context to read a message). | 
 |  | 
 | Don't check messages/handle events when you have a DONT_MIGRATE uthread.  There | 
 | are issues with preemption recovery if you do.  In short, if two uthreads are | 
 | both DONT_MIGRATE with notifs enabled on two different vcores, and one vcore | 
 | gets preempted while the other gets an IPI telling it to recover the other one, | 
 | both could keep bouncing back and forth if they handle their preemption | 
 | *messages* without dealing with their own DONT_MIGRATEs first.  Note that the | 
 | preemption recovery code can handle having a DONT_MIGRATE thread on the vcore. | 
 | This is a special case, and it is very careful about how cur_uthread works. | 
 |  | 
 | All uses of DONT_MIGRATE must reenable notifs (and check messages) at some | 
 | point.  One such case is uthread_yield().  Another is mcs_unlock_notifsafe(). | 
 | Note that mcs_notif_safe locks have uthreads that can't migrate for a | 
 | potentially long time.  notifs are also disabled, so it's not a big deal.  It's | 
 | basically just the same as if you were in vcore context (though technically you | 
 | aren't) when it comes to preemption recovery: we'll just need to restart the | 
 | vcore via a syscall.  Also note that it would be a real pain in the ass to | 
 | migrate a notif_safe locking uthread.  The whole point of it is in case it grabs | 
 | a lock that would be held by vcore context, and there's no way to know it isn't | 
 | a lock on the restart-path. | 
 |  | 
 | 3.9 Why Preemption Handling Doesn't Lock Up (probably) | 
 | --------------------------------------- | 
 | One of the concerns of preemption handling is that we don't get into some form | 
 | of livelock, where we ping-pong back and forth between vcores (or a set of | 
 | vcores), all of which are trying to handle each other's preemptions.  Part of | 
 | the concern is that when a vcore sys_changes to another, it can result in | 
 | another preemption message being sent.  We want to be sure that we're making | 
 | progress, and not just livelocked doing sys_change_vcore()s. | 
 |  | 
 | A few notes first: | 
 | 1) If a vcore is holding locks or otherwise isn't handling events and is | 
 | preempted, it will let go of its locks before it gets to the point of | 
 | attempting to handle any other vcore preemption events.  Event handling is only | 
 | done when it is okay to never return (meaning no locks are held).  If this is | 
 | the situation, eventually it'll work itself out or get to a potential ping-pong | 
 | scenario. | 
 |  | 
 | 2) When you change_to while handling preemption, once you start back up, you | 
 | will leave change_to and eventually fetch a new event.  This means any | 
 | potential ping-pong needs to happen on a fresh event. | 
 |  | 
 | 3) If there are enough pcores for the vcores to all run, we won't issue any | 
 | change_tos, since the vcores are no longer preempted.  This means we only are | 
 | worried about situations with insufficient vcores.  We'll mostly talk about 1 | 
 | pcore and 2 vcores. | 
 |  | 
 | 4) Preemption handlers will not call change_to on their target vcore if they | 
 | are also the one STEALING from that vcore.  The handler will stop STEALING | 
 | first. | 
 |  | 
 | So the only way to get stuck permanently is if both cores are stuck doing a | 
 | sys_change_to(FALSE).  This means we want to become the other vcore, *and* we | 
 | need to restart our vcore where it left off.  This is due to some invariant | 
 | that keeps us from abandoning vcore context.  If we were to abandon vcore | 
 | context (with a sys_change_to(TRUE)), we basically don't need to be | 
 | preempt-recovered.  We already packaged up our cur_uthread, and we know we | 
 | aren't holding any locks or otherwise breaking any invariants.  The system will | 
 | work fine if we never run again.  (Someone just needs to check our messages). | 
 |  | 
 | Now, there are only two cases where we will do a sys_change_to(FALSE) *while* | 
 | handling preemptions.  Again, we aren't concerned about things like MCS-PDR | 
 | locks; those all work because the change_tos are done where we'd normally just | 
 | busy loop.  We are only concerned about change_tos during handle_vc_preempt. | 
 | These two cases are when the changing/handling vcore has a DONT_MIGRATE uthread | 
 | or when someone else is STEALING its uthread.  Note that both of these cases | 
 | are about the calling vcore, not its target. | 
 |  | 
 | If a vcore (referred to as "us") has a DONT_MIGRATE uthread and it is handling | 
 | events, it is because someone else is STEALING from our vcore, and we are in | 
 | the short one-shot event handling loop at the beginning of | 
 | uthread_vcore_entry().  Whichever vcore is STEALING will quickly realize it | 
 | can't steal (it sees the DONT_MIGRATE), and bail out.  If that vcore isn't | 
 | running now, we will change_to it (which is the purpose of our handling their | 
 | preemption).  Once that vcore realizes it can't steal, it will stop STEALING | 
 | and change to us.  At this point, no one is STEALING from us, and we move along | 
 | in the code.  Specifically, we do *not* handle events (we now have an event | 
 | about the other vcore being preempted when it changed_to us), and instead we | 
 | start up the DONT_MIGRATE uthread and let it run until it is migratable, at | 
 | which point we handle events and will deal with the other vcore.   | 
 |  | 
 | So DONT_MIGRATE will be sorted out.  Likewise, STEALING gets sorted out too, | 
 | quite easily.  If someone is STEALING from us, they will quickly stop STEALING | 
 | and change to us.  There are only two ways this could even happen: they are | 
 | running concurrently with us, and somehow saw us out of vcore context before | 
 | deciding to STEAL, or they were in the process of STEALING and got preempted by | 
 | the kernel.  They would not have willingly stopped running while STEALING our | 
 | cur_uthread.  So if we are running and someone is stealing, after a round of | 
 | change_tos, eventually they run, and stop STEALING. | 
 |  | 
 | Note that once someone stops STEALING from us, they will not start again, | 
 | unless we leave vcore context.  If that happened, we basically broke out of the | 
 | ping-pong, and now we're onto another set of preemptions.  We wouldn't leave | 
 | vcore context if we still had preemption events to deal with. | 
 |  | 
 | Finally, note that we needed to only check for one message at a time at the | 
 | beginning of uthread_vcore_entry().  If we just handled the entire mbox without | 
 | checking STEALING, then we might not break out of that loop if there is a | 
 | constant supply of messages (perhaps from a vcore in a similar loop). | 
 |  | 
 | Anyway, that's the basic plan behind the preemption handler and how we avoid | 
 | the ping-ponging.  change_to_vcore() is built so that we handle our own | 
 | preemption before changing (pack up our current uthread), so that we make | 
 | progress.  The two cases where we can't do that get sorted out after everyone | 
 | gets to run once, and since you can't steal or have other uthread's turn on | 
 | DONT_MIGRATE while we're in vcore context, eventually we clear everything up. | 
 | There might be other bugs or weird corner cases, possibly involving multiple | 
 | vcores, but I think we're okay for now. | 
 |  | 
 | 3.10: Handling Messages for Other Vcores | 
 | --------------------------------------- | 
 | First, remember that when a vcore handles an event, there's no guarantee that | 
 | the vcore will return from the handler.  It may start fresh in vcore_entry(). | 
 |  | 
 | The issue is that when you handle another vcore's INDIRs, you may handle | 
 | preemption messages.  If you have to do a change_to, the kernel will make sure | 
 | a message goes out about your demise.  Thus someone who recovers that will | 
 | check your public mbox.  However, the recoverer won't know that you were | 
 | working on another vcore's mbox, so those messages might never be checked. | 
 |  | 
 | The way around it is to send yourself a "check the other guy's messages" event. | 
 | When we might change_to and never return, if we were dealing with another | 
 | vcores mbox, we'll send ourselves a message to finish up that mbox (if there | 
 | are any messages left).  Whoever reads our messages will eventually get that | 
 | message, and deal with it. | 
 |  | 
 | One thing that is a little ugly is that the way you deal with messages two | 
 | layers deep is to send yourself the message.  So if VC1 is handling VC2's | 
 | messages, and then wants to change_to VC3, VC1 sends a message to VC1 to check | 
 | VC2.  Later, when VC3 is checking VC1's messages, it'll handle the "check VC2's messages" | 
 | message.  VC3 can't directly handle VC2's messages, since it could run a | 
 | handler that doesn't return.  Nor can we just forget about VC2.  So VC3 sends | 
 | itself a message to check VC2 later.  Alternatively, VC3 could send itself a | 
 | message to continue checking VC1, and then move on to VC2.  Both seem | 
 | equivalent.  In either case, we ought to check to make sure the mbox has | 
 | something before bothering sending the message. | 
 |  | 
 | So for either a "change_to that might not return" or for a "check INDIRs on yet | 
 | another vcore", we send messages to ourself so that we or someone else will | 
 | deal with it. | 
 |  | 
 | Note that we use TLS to track whether or not we are handling another vcore's | 
 | messages, and if we do plan to change_to that might not return, we clear the | 
 | bool so that when our vcore starts over at vcore_entry(), it starts over and | 
 | isn't still checking someone elses message. | 
 |  | 
 | As a reminder of why this is important: these messages we are hunting down | 
 | include INDIRs, specifically ones to ev_qs such as the "syscall completed | 
 | ev_q".  If we never get that message, a uthread will block forever.  If we | 
 | accidentally yield a vcore instead of checking that message, we would end up | 
 | yielding the process forever since that uthread will eventually be the last | 
 | one, but our main thread is probably blocked on a join call.  Our process is | 
 | blocked on a message that already came, but we just missed it.  | 
 |  | 
 | 4. Single-core Process (SCP) Events: | 
 | ==================== | 
 | 4.1 Basics: | 
 | --------------------------------------- | 
 | Event delivery is important for SCP's blocking syscalls.  It can also be used | 
 | (in the future) to deliver POSIX signals, which would just be another kernel | 
 | event. | 
 |  | 
 | SCPs can receive events just like MCPs.  For the most part, the code paths are | 
 | the same on both sides of the K/U interface.  The kernel sends events (which | 
 | can detect an SCP and will send it to vcore0), the kernel will make sure you | 
 | can't yield/miss an event, etc.  Userspace preps vcore context in advance, and | 
 | can do all the things vcore context does: handle events, select thread to run. | 
 | For an SCP, there is only one thread to run. | 
 |  | 
 | 4.2 Degenerate Event Delivery: | 
 | --------------------------------------- | 
 | That being said, there are a few tricky things.  First, there is a time before | 
 | the SCP is ready to fully receive events.  Specifically, before | 
 | vcore_event_init(), which is called out of glibc's _start.  More importantly, | 
 | the runtime linker never calls that function, yet it wants to block. | 
 |  | 
 | The important thing to note is that there are a few parts to event delivery: | 
 | registration (user), sending the event (kernel), making sure the proc wakes up | 
 | (kernel), and actually handling the event (user).  For syscalls, the only thing | 
 | the process (even rtld) needs is the first three.  Registration is easy - can be | 
 | done with nothing more than kernel headers (no need for parlib) for NO_MSG ev_qs | 
 | (no need to init the UCQ).  Event handling is trickier, and requires parlib | 
 | (which rtld can't link against).  To support processes that could register for | 
 | events, but not handle them (or even enter vcore context), the kernel needed a | 
 | few changes (checking the VC_SCP_NOVCCTX flag) so that it would wake the | 
 | process, but never put it in vcore context.   | 
 |  | 
 | This degenerate event handling just wakes the process up, at which point it can | 
 | check on its syscall.  Very early in the process's life, it'll init vcore0's | 
 | UCQ and be able to handle full events, enter vcore context, etc. | 
 |  | 
 | Once the SCP is up and running, it can receive events like normal.  One thing to | 
 | note is that the SCPs are not using a handle_syscall() event handler, like the | 
 | MCPs do.  They are only using the event to get the process restarted, at which | 
 | point their vcore 0 restarts thread0.  One consequence of this is that if a | 
 | process receives an unrelated event while blocking on a syscall, it'll handle | 
 | that event, then restart thread0.  Thread0 will see its syscall isn't complete, | 
 | and then re-block.  (It also re-registers its ev_q, which is harmless).  When | 
 | that syscall is finally done, the kernel will send an event and wake it up | 
 | again. | 
 |  | 
 | 4.3 Extra Tidbits: | 
 | --------------------------------------- | 
 | If we receive an event right as we transition from SCP to MCP, vcore0 could get | 
 | spammed with a message that is never received.  Right now, it's not a problem, | 
 | since vcore0 is the first vcore that will get woken up as an MCP.  This could be | 
 | an issue if we ever allow transitions from MCP back to SCP. | 
 |  | 
 | On a related note, it's now wrong for SCPs to sys_yield(FALSE) (not being nice, | 
 | meaning they are waiting for an event) in a loop that does not check events or | 
 | otherwise allow them to break out of that loop.  This should be fairly obvious. | 
 | A little more subtle is that these loops also need to sort out notif_pending. | 
 | If you are trying to yield and still have an old notif_pending set, the kernel | 
 | won't let you yield (it thinks you are missing the notif).  For the degenerate | 
 | mode, (VC_SCP_NOVCCTX is set on vcore0), the kernel will handle dealing with | 
 | this flag. | 
 |  | 
 | Finally, note that while the SCP is in vcore context, it has none of the | 
 | guarantees of an MCP.  It's somewhat meaningless to talk about being gang | 
 | scheduled or knowing about the state of other vcores.  If you're running, you're | 
 | on a physical core.  You may get unexpected interrupts, descheduled, etc.  Aside | 
 | from the guarantees and being the only vcore, the main differences are really up | 
 | to the kernel scheduler.  In that sense, we have somewhat of a new state for | 
 | processes - SCPs that can enter vcore context.  From the user's perspective, | 
 | they look a lot like an MCP, and the degenerate/early mode SCPs are like the | 
 | old, dumb SCPs.  The big difference for userspace is that there isn't a 2LS yet | 
 | (will need to reinit things slightly).  The kernel treats SCPs and MCPs very | 
 | differently too, but that may not always be the case. | 
 |  | 
 | 5. Misc Things That Aren't Sorted Completely: | 
 | ==================== | 
 | 5.1 What about short handlers? | 
 | --------------------------------------- | 
 | Once we sort the other issues, we can ask for them via a flag in the event_q, | 
 | and run the handler in the event_q struct. | 
 |  | 
 | 5.2 What about blocking on a syscall? | 
 | --------------------------------------- | 
 | The current plan is to set a flag, and let the kernel go from there.  The | 
 | kernel knows which process it is, since that info is saved in the kthread that | 
 | blocked.  One issue is that the process could muck with that flag and then go | 
 | to sleep forever.  To deal with that, maybe we'd have a long running timer to | 
 | reap those.  Arguably, it's like having a process while(1).  You can screw | 
 | yourself, etc.  Killing the process would still work. |