|  | async_events.txt | 
|  | Barret Rhoden | 
|  |  | 
|  | 1. Overview | 
|  | 2. Async Syscalls and I/O | 
|  | 3. Event Delivery / Notification | 
|  | 4. Single-core Process (SCP) Events | 
|  | 5. Misc Things That Aren't Sorted Completely: | 
|  |  | 
|  | 1. Overview | 
|  | ==================== | 
|  | 1.1 Event Handling / Notifications / Async IO Issues: | 
|  | ------------------------------------------------------------------ | 
|  | Basically, syscalls use the ROS event delivery mechanisms, redefined and | 
|  | described below.  Syscalls use the event delivery just like any other | 
|  | subsystem would that wants to deliver messages to a process.  The only other | 
|  | example we have right now are the "kernel notifications", which are the | 
|  | one-sided, kernel-initiated messages that the kernel sends to a process. | 
|  |  | 
|  | Overall, there are several analogies from how vcores work to how the OS | 
|  | handles interrupts.  This is a result of trying to make vcores run like | 
|  | virtual multiprocessors, in control of their resources and aware of the lower | 
|  | levels of the system.  This analogy has guided much of how the vcore layer | 
|  | works.  Whenever we have issues with the 2-lsched, realize the amount of | 
|  | control they want means using solutions that the OS must do too. | 
|  |  | 
|  | Note that there is some pointer chasing going on, though we try to keep it to | 
|  | a minimum.  Any time the kernel chases a pointer, it needs to make sure it is | 
|  | in the R/W section of userspace, though it doesn't need to check if the page | 
|  | is present.  There's more info in the Page Fault sections of the | 
|  | documentation.  (Briefly, if the kernel PFs on a user address, it will either | 
|  | block and handle the PF, or if the address was unmapped, it will kill the | 
|  | process). | 
|  |  | 
|  | 1.2 Some Definitions: | 
|  | --------------------------------------- | 
|  | ev_q, event_queue, event_q: all terms used interchangeably with each other. | 
|  | They are the endpoint for communicating messages to a process, encapsulating | 
|  | the method of delivery (such as IPI or not) with where to save the message. | 
|  |  | 
|  | Vcore context: the execution context of the virtual core on the "trampoline" | 
|  | stack.  All executions start from the top of this stack, and no stack state is | 
|  | saved between vcore_entry() calls.  All executions on here are non-blocking, | 
|  | notifications (IPIs) are disabled, and there is a specific TLS loaded.  Vcore | 
|  | context is used for running the second level scheduler (2LS), swapping between | 
|  | threads, and handling notifications.  It is analagous to "interrupt context" | 
|  | in the OS.  Any functions called from here should be brief.  Any memory | 
|  | touched must be pinned.  In Lithe terms, vcore context might be called the | 
|  | Hart / hard thread.  People often wonder if they can run out of vcore context | 
|  | directly.  Technically, you can, but you lose the ability to take any fault | 
|  | (page fault) or to get IPIs for notification.  In essence, you lose control, | 
|  | analgous to running an application in the kernel with preemption/interrupts | 
|  | disabled.  See the process documentation for more info. | 
|  |  | 
|  | 2LS: is the second level scheduler/framework.  This code executes in vcore | 
|  | context, and is Lithe / plugs in to Lithe (eventually).  Often used | 
|  | interchangeably with "vcore context", usually when I want to emphasize the | 
|  | scheduling nature of the code. | 
|  |  | 
|  | VCPD: "virtual core preemption data".  In procdata, there is an array of | 
|  | struct preempt_data, one per vcore.  This is the default location to look for | 
|  | all things related to the management of vcores, such as its event_mbox (queue | 
|  | of incoming messages/notifications/events).  Both the kernel and the vcore | 
|  | code know to look here for a variety of things. | 
|  |  | 
|  | Vcore-business: This is a term I use for a class of messages where the receiver | 
|  | is the actual vcore, and not just using the vcore as a place to receive the | 
|  | message.  Examples of vcore-business are INDIR events, preempt_pending events, | 
|  | scheduling events (self-ipis by the 2LS from one vcore to another), and things | 
|  | like that.  There are two types: public and private.  Private will only be | 
|  | handled by that vcore.  Public might be handled by another vcore. | 
|  |  | 
|  | Notif_table: This is a list of event_q*s that correspond to certain | 
|  | unexpected/"one-sided" events the kernel sends to the process.  It is similar | 
|  | to an IRQ table in the kernel.  Each event_q tells the kernel how the process | 
|  | wants to be told about the specific event type. | 
|  |  | 
|  | Notifications: used to be a generic event, but now used in terms of the verb | 
|  | 'notify' (do_notify()).  In older docs, passive notification is just writing a | 
|  | message somewhere.  Active notification is an IPI delivered to a vcore.  I use | 
|  | that term interchangeably with an IPI, and usually you can tell by context | 
|  | that I'm talking about an IPI going to a process (and not just the kernel). | 
|  | The details of it make it more complicated than just an IPI, but it's | 
|  | analagous.  I've start referring to notification as the IPI, and "passive | 
|  | notification" as just events, though older documentation has both meanings. | 
|  |  | 
|  | BCQ: "bounded concurrent queue".  It is a fixed size array of messages | 
|  | (structs of notification events, or whatever).  It is non-blocking, supporting | 
|  | multiple producers and consumers, where the producers do not trust the | 
|  | consumers.  It is the primary mechanism for the kernel delivering message | 
|  | payloads into a process's address space.  Note that producers don't trust each | 
|  | other either (in the event of weirdness, the producers give up and say the | 
|  | buffer is full).  This means that a process can produce for one of its ev_qs | 
|  | (which is what they need to do to send message to itself). | 
|  |  | 
|  | UCQ: "unbounded concurrent queue".  This is a data structure allowing the kernel | 
|  | to produce an unbounded number of messages for the process to consume.  The main | 
|  | limitation to the number of messages is RAM.  Check out its documentation. | 
|  |  | 
|  | 2. Async Syscalls and I/O | 
|  | ==================== | 
|  | 2.1 Basics | 
|  | ---------------------------------------------- | 
|  | The syscall struct is the contract for work with the kernel, including async | 
|  | I/O.  Lots of current OS async packages use epoll or other polling systems. | 
|  | Note the distinction between Polling and Async I/O.  Polling is about finding | 
|  | out if a call will block.  It is primarily used for sockets and pipes.  It | 
|  | does relatively nothing for disk I/O, which requires a separate async I/O | 
|  | system.  By having all syscalls be async, we can make polling a bit easier and | 
|  | more unified with the generic event code that we use for all syscalls. | 
|  |  | 
|  | For instance, we can have a sys_poll syscall, which is async just like any | 
|  | other syscall.  The call can be a "one shot / non-blocking", like the current | 
|  | systems polling code, or it can also notify on change (not requiring future | 
|  | polls) via the event_q mechanisms.  If you don't want to be IPId, you can | 
|  | "poll" the syscall struct - not requiring another kernel crossing/syscall. | 
|  |  | 
|  | Note that we do not tie syscalls and polling to FDs.  We do events on | 
|  | syscalls, which can be used to check FDs.  I think a bunch of polling cases | 
|  | will not be needed once we have async syscalls, but for those that remain, | 
|  | we'll have sys_poll() (or whatever). | 
|  |  | 
|  | To receive an event on a syscall completion or status change, just fill in the | 
|  | event_q pointer.  If it is 0, the kernel will assume you poll the actual | 
|  | syscall struct. | 
|  |  | 
|  | struct syscall { | 
|  | current stuff 		/* arguments, retvals */ | 
|  | struct ev_queue * 	/* struct used for messaging, including IPIs*/ | 
|  | void * 			/* used by 2LS, usually a struct u_thread * */ | 
|  | } | 
|  |  | 
|  | One issue with async syscalls is that there can be too many outstanding IOs | 
|  | (normally sync calls provide feedback / don't allow you to over-request). | 
|  | Eventually, processes can exhaust kernel memory (the kthreads, specifically). | 
|  | We need a way to limit the kthreads per proc, etc.  Shouldn't be a big deal. | 
|  |  | 
|  | Normally, we talk about changing the flag in a syscall to SC_DONE.  Async | 
|  | syscalls can be SC_PROGRESS (new stuff happened on it), which can trigger a | 
|  | notification event.  Some calls, like AIO or bulk accept, exist for a while | 
|  | and slowly get filled in / completed.  In the future, we'll also want a way to | 
|  | abort the in-progress syscalls (possibly any syscall!). | 
|  |  | 
|  | 2.2 Uthreads Blocking on Syscalls | 
|  | ---------------------------------------------- | 
|  | Many threading libraries will want some notion of a synchronous, blocking | 
|  | thread.  These threads use regular I/O calls, which are async under the hood, | 
|  | but don't want to bother with call backs or other details of async I/O.  In | 
|  | this section, I'll talk a bit about how this works, esp regarding | 
|  | uthreads/pthreads. | 
|  |  | 
|  | 'Blocking' refers to user threads, and has nothing to do with an actual | 
|  | process blocking/waiting on some kernel event.  The kernel does not know | 
|  | anything about what goes on here.  While a bit confusing, this allows | 
|  | applications to do whatever they want on top of an async interface, and is a | 
|  | consequence of decoupling cores from user-threads from kthreads. | 
|  |  | 
|  | 2.2.1 Basics of Uthread Blocking | 
|  | --------------- | 
|  | When a thread calls a glibc function that makes a system call, if the syscall | 
|  | is not yet complete when the kernel returns to userspace, glibc will check for | 
|  | the existence of a second level scheduler and attempt to use it to yield its | 
|  | uthread.  If there is no 2LS, the code just spins for now.  Eventually, it | 
|  | will try to suspend/yield the process for a while (til the call is done), aka, | 
|  | block in the kernel. | 
|  |  | 
|  | If there is a 2LS, the current thread will yield, and call out to the 2LS's | 
|  | blockon_sysc() method, which needs a way to stop the thread and be able to | 
|  | restart it when the syscall completes.  Specifically, the pthread 2LS registers | 
|  | the syscall to respond to an event (described in detail elsewhere in this doc). | 
|  | When the event comes in, meaning the syscall is complete, the thread is put on | 
|  | the runnable list. | 
|  |  | 
|  | Details: | 
|  | - A pointer to the struct pthread is stored in the syscall's void*.  When the | 
|  | syscall is done, we normally get a message from the kernel, and the payload | 
|  | tells us the syscall is done, which tells us which thread to unblock. | 
|  | - The pthread code also always asks for an IPI and event message for every | 
|  | syscall that completes.  This is far from ideal.  Still, the basics are the | 
|  | same for any threading library.  Once you know a thread is done, you need to | 
|  | do something about it. | 
|  | - The pthread code does syscall blocking and event notification on a per-core | 
|  | basis.  Using the default (VCPD) ev_mbox for this is a bad idea (which we did | 
|  | at some point). | 
|  | - There's a race between the 2LS trying to sign up for events and the kernel | 
|  | finishing the event.  We handle this in uthread code, so use the helper to | 
|  | register_evq(), which does the the right thing (atomics, careful ordering | 
|  | with writes, etc). | 
|  |  | 
|  | 2.2.1 Recovering from Event Overflow | 
|  | --------------- | 
|  | Event overflow recovery is unnecessary, since syscall ev_qs use UCQs now.  this | 
|  | section is kept around for some useful tidbits, such as details about | 
|  | deregistering ev_qs for a syscall: | 
|  |  | 
|  | --------------------------- | 
|  | The pthread code expects to receive an event somehow to unblock a thread | 
|  | once its syscall is done.  One limitation to our messaging systems is that you | 
|  | can't send an infinite amount of event messages.  (By messages, I mean a chunk | 
|  | of memory with a payload, in this case consisting of a struct syscall *). | 
|  | Event delivery degrades to a bit in the case of the message queue being full | 
|  | (more details on that later). | 
|  |  | 
|  | The pthread code (and any similar 2LS) needs to handle waking up syscalls when | 
|  | the event message was lost and all we know is that some syscall that was meant | 
|  | to have a message sent to a particular event queue (per-core in the case of | 
|  | pthread stuff (actually the VCPD for now)).  The basic idea is to poll all | 
|  | outstanding system calls and unblock whoever is done. | 
|  |  | 
|  | The key problem is due to a race: for a given syscall we don't know if we're | 
|  | going to get a message for a syscall or not.  There could be a completion | 
|  | message in the queue for the syscall while we are going through the list of | 
|  | blocked threads.  If we assume we already got the message (or it was lost in | 
|  | the overflow), but didn't really, then if we finish as SC and free its memory | 
|  | (free or return up the stack), we could later get a message for it, and all | 
|  | sorts of things would go wrong (like trying to unblock a pointer that is | 
|  | gibberish). | 
|  |  | 
|  | Here's what we do: | 
|  | 1) Set a "handling overflow" flag so we don't recurse. | 
|  | 2) Turn off event delivery for all syscalls on our list | 
|  | 3) Handle any event messages.  This is how we make a distinction between | 
|  | finished syscalls that had a message sent and those that didn't.  We're doing | 
|  | the message-sent ones here. | 
|  | 4) For any left on the list, check to see if they are done.  We actually do | 
|  | this by attempting to turn on event delivery for them.  Turning on event | 
|  | delivery can fail if the call is already done.  So if it fails, they are done | 
|  | and we unblock them (similar to how we block the threads in the first place). | 
|  | If it doesn't fail, they are now ready to receive messages.  This can be | 
|  | tweaked a bit. | 
|  | 5) Unset the overflow-handling flag. | 
|  |  | 
|  | One thing to be careful of is that when we turn off event delivery, you need to | 
|  | be sure the kernel isn't in the process of sending an event.  This is why we | 
|  | have the SC_K_LOCK syscall flag.  Uthread code will not consider deregistration | 
|  | complete while that flag is set, since the kernel is still mucking with the | 
|  | syscall (and sending an event).  Once the flag is clear, the event has been | 
|  | delivered (the ev_msg is in the ev_mbox), and our assumptions remain true. | 
|  |  | 
|  | There are a couple implications of this style.  If you have a shared event | 
|  | queue (with other event sources), those events can get mixed in with the | 
|  | recovery.  Don't leave the vcore context due to other events.  This'll | 
|  | probably need work.  The other thing is that completed syscalls can get | 
|  | handled in a different order than they were signaled.  Shouldn't be a big | 
|  | deal. | 
|  |  | 
|  | Note on the overflow handling flag and unsetting it.  There should not be any | 
|  | races with this.  The flag prevented us from handling overflows on the event | 
|  | queue.  Other than when we checked for events that had been succesfully sent, | 
|  | we didn't try to handle events.  We can unset the flag, and at that point we | 
|  | can start handling missed events.  If there was an overflow after we last | 
|  | checked the list, but before we cleared the overflow-handling flag, we'll | 
|  | still catch it since we haven't tried handling events in between checking the | 
|  | list and clearing the flag.  That flag doesn't even matter until we want to | 
|  | handle_events, so we aren't missing anything.  the next handle_events() will | 
|  | deal with everything from scratch. | 
|  |  | 
|  | For blocking threads that block concurrently with the overflow handling: in | 
|  | the pthread case, this can't happen since everything is per-vcore.  If you do | 
|  | have process-wide thread blocking/syscall management, we can add new ones, but | 
|  | they must have event delivery turned off when they are added to the list.  And | 
|  | you'll need to lock the list, etc.  This should work in part due to new | 
|  | syscalls being added to the end of the list, and the overflow-handler | 
|  | proceeding linearly through the list. | 
|  |  | 
|  | Also note that we shouldn't handle the event for unblocking a syscall on a | 
|  | different core than the one it was submitted to.  This could result in | 
|  | concurrent modifications to the original core's TAILQ (bad).  This restriction | 
|  | is dependent on how a 2LS does its thread handling/blocking. | 
|  |  | 
|  | Eventually, we'll want a way to detect and handle excessive overflow, since | 
|  | it's probably quite expensive.  Perhaps turn it off and periodically poll the | 
|  | syscalls for completion (but don't bother turning on the ev_q). | 
|  | --------------------------- | 
|  |  | 
|  | 3. Event Delivery / Notification | 
|  | ==================== | 
|  | 3.1 Basics | 
|  | ---------------------------------------------- | 
|  | The mbox (mailbox) is where the actual messages go. | 
|  |  | 
|  | struct ev_mbox { | 
|  | bcq of notif_events 	/* bounded buffer, multi-consumer/producer */ | 
|  | msg_bitmap | 
|  | } | 
|  | struct ev_queue {		/* aka, event_q, ev_q, etc. */ | 
|  | struct ev_mbox * | 
|  | void handler(struct event_q *) | 
|  | vcore_to_be_told | 
|  | flags 			/* IPI_WANTED, RR, 2L-handle-it, etc */ | 
|  | } | 
|  | struct ev_queue_big { | 
|  | struct ev_mbox *	/* pointing to the internal storage */ | 
|  | vcore_to_be_told | 
|  | flags 			/* IPI_WANTED, RR, 2L-handle-it, etc */ | 
|  | struct ev_mbox { }	/* never access this directly */ | 
|  | } | 
|  |  | 
|  | The purpose of the big one is to simply embed some storage.  Still, only | 
|  | access the mbox via the pointer.  The big one can be casted (and stored as) | 
|  | the regular, so long as you know to dealloc a big one (free() knows, custom | 
|  | styles or slabs would need some help). | 
|  |  | 
|  | The ev_mbox says where to put the actual message, and the flags handle things | 
|  | such as whether or not an IPI is wanted. | 
|  |  | 
|  | Using pointers for the ev_q like this allows multiple event queues to use the | 
|  | same mbox.  For example, we could use the vcpd queue for both kernel-generated | 
|  | events as well as async syscall responses.  The notification table is actually | 
|  | a bunch of ev_qs, many of which could be pointing to the same vcore/vcpd-mbox, | 
|  | albeit with different flags. | 
|  |  | 
|  | 3.2 Kernel Notification Using Event Queues | 
|  | ---------------------------------------------- | 
|  | The notif_tbl/notif_methods (kernel-generated 'one-sided' events) is just an | 
|  | array of struct ev_queue*s.  Handling a notification is like any other time | 
|  | when we want to send an event.  Follow a pointer, send a message, etc.  As | 
|  | with all ev_qs, ev_mbox* points to where you want the message for the event, | 
|  | which usually is the vcpd's mbox.  If the ev_q pointer is 0, then we know the | 
|  | process doesn't want the event (equivalent to the older 'NOTIF_WANTED' flag). | 
|  | Theoretically, we can send kernel notifs to user threads.  While it isn't | 
|  | clear that anyone will ever want this, it is possible (barring other issues), | 
|  | since they are just events. | 
|  |  | 
|  | Also note the flag EVENT_VCORE_APPRO.  Processes should set this for certain | 
|  | types of events where they want the kernel to send the event/IPI to the | 
|  | 'appropriate' vcore.  For example, when sending a message about a preemption | 
|  | coming in, it makes sense for the kernel to send it to the vcore that is going | 
|  | to get preempted, but the application could choose to ignore the notification. | 
|  | When this flag is set, the kernel will also use the vcore's ev_mbox, ignoring | 
|  | the process's choice.  We can change this later, but it doesn't really make | 
|  | sense for a process to pick an mbox and also say VCORE_APPRO. | 
|  |  | 
|  | There are also interfaces in the kernel to put a message in an ev_mbox | 
|  | regardless of the process's wishes (post_vcore_event()), and can send an IPI | 
|  | at any time (proc_notify()). | 
|  |  | 
|  | 3.3 IPIs, Indirection Events, and Fallback (Spamming Indirs) | 
|  | ---------------------------------------------- | 
|  | An ev_q can ask for an IPI, for an indirection event, and for an indirection | 
|  | event to be spammed in case a vcore is offline (sometimes called the 'fallback' | 
|  | option.  Or any combination of these.  Note that these have little to do with | 
|  | the actual message being sent.  The actual message is dropped in the ev_mbox | 
|  | pointed to by the ev_q. | 
|  |  | 
|  | The main use for all of this is for syscalls.  If you want to receive an event | 
|  | when a syscall completes or has a change in status, simply allocate an event_q, | 
|  | and point the syscall at it.  syscall: ev_q* -> "vcore for IPI, syscall message | 
|  | in the ev_q mbox", etc.  You can also point it to an existing ev_q.  Pthread | 
|  | code has examples of two ways to do this.  Both have per vcore ev_qs, requesting | 
|  | IPIs, INDIRS, and SPAM_INDIR.  One way is to have an ev_mbox per vcore, and | 
|  | another is to have a global ev_mbox that all ev_qs point to.  As a side note, if | 
|  | you do the latter, you don't need to worry about a vcore's ev_q if it gets | 
|  | preempted: just check the global ev_mbox (which is done by checking your own | 
|  | vcore's syscall ev_q). | 
|  |  | 
|  | 3.3.1: IPIs and INDIRs | 
|  | --------------- | 
|  | An EVENT_IPI simply means we'll send an IPI to the given vcore.  Nothing else. | 
|  | This will usually be paired with an Indirection event (EVENT_INDIR).  An INDIR | 
|  | is a message of type EV_EVENT with an ev_q* payload.  It means "check this | 
|  | ev_q".  Most ev_qs that ask for an IPI will also want an INDIR so that the vcore | 
|  | knows why it was IPIed.  You don't have to do this: for instance, your 2LS might | 
|  | poll its own ev_q, so you won't need the indirection event. | 
|  |  | 
|  | Additionally, note that IPIs and INDIRs can be spurious.  It's not a big deal to | 
|  | receive and IPI and have nothing to do, or to be told to check an empty ev_q. | 
|  | All of the event handling code can deal with this. | 
|  |  | 
|  | INDIR events are sent to the VCPD public mbox, which means they will get handled | 
|  | if the vcore gets preempted.  Any other messages sent here will also get handled | 
|  | during a preemption.  However, the only type of messages you should use this for | 
|  | are ones that can handle spurious messages.  The completion of a syscall is an | 
|  | example of a message that cannot be spurious.  Since INDIRs can be spurious, we | 
|  | can use the public mbox.  (Side note: the kernel may spam INDIRs in attempting | 
|  | to make sure you get the message on a vcore that didn't yield.) | 
|  |  | 
|  | Never use a VCPD mbox (public or private) for messages you might want to receive | 
|  | if that vcore is offline.  If you want to be sure to get a message, create your | 
|  | own ev_q and set flags for INDIR, SPAM_INDIR, and IPI.  There's no guarantee a | 
|  | *specific* message will get looked at.  In cases where it won't, the kernel will | 
|  | send that message to another vcore.  For example, if the kernel posts an INDIR | 
|  | to a VCPD mbox (the public one btw) and it loses a race with the vcore yielding, | 
|  | the vcore might never see that message.  However, the kernel knows it lost the | 
|  | race, and will find another vcore to send it to. | 
|  |  | 
|  | 3.3.2: Spamming Indirs / Fallback | 
|  | --------------- | 
|  | Both IPI and INDIR need an actual vcore.  If that vcore is unavailable and if | 
|  | EVENT_SPAM_INDIR is set, the kernel will pick another vcore and send the | 
|  | messages there.  This allows an ev_q to be set up to handle work when the vcore | 
|  | is online, while allowing the program to handle events when that core yields, | 
|  | without having to reset all of its ev_qs to point to "known" available vcores | 
|  | (and avoiding those races).  Note 'online' is synonymous with 'mapped', when | 
|  | talking about vcores.  A vcore technically isn't always online, only destined | 
|  | to be online, when it is mapped to a pcore (kmsg on the way, etc).  It's | 
|  | easiest to think of it being online for the sake of this discussion. | 
|  |  | 
|  | One question is whether or not 2LSs need a SPAM_INDIR flag for their ev_qs. | 
|  | The main use for SPAM_INDIR is so that vcores can yield.  (Note that fallback | 
|  | won't help you *miss* INDIR messages in the event of a preemption; you can | 
|  | always lose that race due to it taking too long to process the messages).  An | 
|  | alternative would be for vcores to pick another vcore and change all of its | 
|  | ev_qs to that vcore.  There are a couple problems with this.  One is that it'll | 
|  | be a pain to get those ev_qs back when the vcore comes back online (if ever). | 
|  | Another issue is that other vcores will build up a list of ev_qs that they | 
|  | aren't aware of, which will be hard to deal with when *they* yield.  SPAM_INDIR | 
|  | avoids all of those problems. | 
|  |  | 
|  | An important aspect of spamming indirs is that it works with yielded vcores, | 
|  | not preempted vcores.  It could be that there are no cores that are online, but | 
|  | there should always be at least one core that *will* be online in the future, a | 
|  | core that the process didn't want to lose and will deal with in the future.  If | 
|  | not for this distinction, SPAM_INDIR could fail.  An older idea would be to have | 
|  | fallback send the msg to the desired vcore if there were no others.  This would | 
|  | not work if the vcore yielded and then the entire process was preempted or | 
|  | otherwise not running.  Another way to put this is that we need a field to | 
|  | determine whether a vcore is offline temporarily or permanently. | 
|  |  | 
|  | This is why we have the VCPD flag 'VC_CAN_RCV_MSG'.  It tells the kernel's event | 
|  | delivery code that the vcore will check the messages: it is an acceptable | 
|  | destination for a spammed indir.  There are two reasons to put this in VCPD: | 
|  | 1) Userspace can remotely turn off a vcore's msg reception.  This is necessary | 
|  | for handling preemption of a vcore that was in uthread context, so that we can | 
|  | remotely 'yield' the core without having to sys_change_vcore() (which I discuss | 
|  | below, and is meant to 'unstick' a vcore). | 
|  | 2) Yield is simplified.  The kernel no longer races with itself nor has to worry | 
|  | about turning off that flag - userspace can do it when it wants to yield.  (turn | 
|  | off the flag, check messages, then yield).  This is less big of a deal now that | 
|  | the kernel races with vcore membership in the online_vcs list. | 
|  |  | 
|  | Two aspects of the code make this work nicely.  The VC_CAN_RCV_MSG flag greatly | 
|  | simplifies the kernel's job.  There are a lot of weird races we'd have to deal | 
|  | with, such as process state (RUNNING_M), whether a mass preempt is going on, or | 
|  | just one core, or a bunch of cores, mass yields, etc.  A flag that does one | 
|  | thing well helps a lot - esp since preemption is not the same as yielding.  The | 
|  | other useful thing is being able to handle spurious events.  Vcore code can | 
|  | handle extra IPIs and INDIRs to non-VCPD ev_qs.  Any vcore can handle an ev_q | 
|  | that is "non-VCPD business". | 
|  |  | 
|  | Worth mentioning is the difference between 'notif_pending' and VC_CAN_RCV_MSG. | 
|  | VC_CAN_RCV_MSG is the process saying it will check for messages. | 
|  | 'notif_pending' is when the kernel says it *has* sent a message. | 
|  | 'notif_pending' is also used by the kernel in proc_yield() and the 2LS in | 
|  | pop_user_ctx() to make sure the sent message is not missed. | 
|  |  | 
|  | Also, in case this comes up, there's a slight race on changing the mbox* and the | 
|  | vcore number within the event_q.  The message could have gone to the wrong (old) | 
|  | vcore, but not the IPI.  Not a big deal - IPIs can be spurious, and the other | 
|  | vcore will eventually get it.  The real way around this is create a new ev_q and | 
|  | change the pointer (thus atomically changing the entire ev_q's contents), though | 
|  | this can be a bit tricky if you have multiple places pointing to the same ev_q | 
|  | (can't change them all at once). | 
|  |  | 
|  | 3.3.3: Fallback and Preemption | 
|  | --------------- | 
|  | SPAM_INDIR doesn't protect you from preemptions.  A vcore can be preempted and | 
|  | have INDIRs in its VCPD. | 
|  |  | 
|  | It is tempting to just use sys_change_vcore(), which will change the calling | 
|  | vcore to the new one.  This should only be used to "unstick" a vcore.  A vcore | 
|  | is stuck when it was preempted while it had notifications disabled.  This is | 
|  | usually when it is vcore context, but also in any lock holding code for locks | 
|  | shared with vcore context (the userspace equivalent of irqsave locks).  With | 
|  | this syscall, you could change to the offline vcore and process its INDIRs. | 
|  |  | 
|  | The problem with that plan is the calling core (that is trying to save the | 
|  | other) may have extra messages, and that sys_change_vcore does not return.  We | 
|  | need a way to deal with our other messages.  We're back to the same problem we | 
|  | had before, just with different vcores.  The only thing we really accomplished | 
|  | is that we unstuck the other vcore.  We could tell the restarted vcore (via an | 
|  | event) to switch back to us, but by the time it does that, it may have other | 
|  | events that got lost.  So we're back to polling the ev_qs that it might have | 
|  | received INDIRs about.  Note that we still want to send an event with | 
|  | sys_change_vcore().  We want the new vcore to know the old vcore was put | 
|  | offline: a preemption (albeit one that it chose to do, and one that isn't stuck | 
|  | in vcore context). | 
|  |  | 
|  | One older way to deal with this was to force the 2LS to deal with this. The 2LS | 
|  | would check the ev_mboxes/ev_qs of all ev_qs that could send INDIRS to the | 
|  | offline vcore.  There could be INDIRS in the VCPD that are just lying there. | 
|  | The 2LS knows which ev_qs these are (such as for completed syscalls), and for | 
|  | many things, this will be a common ev_q (such as for 'vcore-x-was-preempted'). | 
|  | However, this is a huge pain in the ass, since a preempted vcore could have the | 
|  | spammed INDIR for an ev_q associated with another vcore.  To deal with this, | 
|  | the 2LS would need to check *every* ev_q that requests INDIRs.  We don't do | 
|  | this. | 
|  |  | 
|  | Instead, we simply have the remote core check the VCPD public mbox of the | 
|  | preempted vcore.  INDIRs (and other vcore business that other vcores can handle) | 
|  | will get sorted here. | 
|  |  | 
|  | 3.3.5: Lists to Find Vcores | 
|  | --------------- | 
|  | A process has three lists: online, bulk_preempt, and inactive.  These not only | 
|  | are good for process management, but also for helping alert_vcore() find | 
|  | potentially alertable vcores.  alert_vcore() and its associated helpers are | 
|  | failry complicated and heavily commented.  I've set things up so both the | 
|  | online_vcs and the bulk_preempted_vcs lists can be handled the same way: post to | 
|  | the first element, then see if it still VC_CAN_RCV_MSG.  If not, if it is still | 
|  | the first on the list, then it hasn't proc_yield()ed yet, and it will eventually | 
|  | restart when it tries to yield.  And this all works without locking the | 
|  | proc_lock.  There are a bunch more details and races avoided.  Check the code | 
|  | out. | 
|  |  | 
|  | 3.3.6: Vcore Business and the VCPD mboxs | 
|  | --------------- | 
|  | There are two types of VCPD mboxes: public and private.  Public ones will get | 
|  | handled during preemption recovery.  Messages sent here need to be handle-able | 
|  | by any vcore.  Private messages are for that specific vcore.  In the common | 
|  | case, the public mbox will usually only get looked at by its vcore.  Only during | 
|  | recovery and some corner cases will we deal with it remotely. | 
|  |  | 
|  | Here's some guidelines: if you message is spammy and the handler can deal with | 
|  | spurious events and it doesn't need to be on a specific vcore, then go with | 
|  | public.  Examples of public mbox events are ones that need to be spammed: | 
|  | preemption recovery, INDIRs, etc.  Note that you won't need to worry about | 
|  | these: uthread code and the kernel handle them.  But if you have something | 
|  | similar, then that's where it would go.  You can also send non-spammy things, | 
|  | but there's no guarantee they'll be looked at. | 
|  |  | 
|  | Some messages should only be sent to the private mbox.  These include ones that | 
|  | make no sense for other vcores to handle.  Examples: 2LS IPIs/preemptions (like | 
|  | "change your scheduling policy vcore 3", preemption-pending notifs from the | 
|  | kernel, timer interrupts, etc. | 
|  |  | 
|  | An example of something that shouldn't be sent to either is syscall completions. | 
|  | They can't be spammed, so you can't send them around like INDIRs.  And they need | 
|  | to be dealt with.  Other than carefully-spammed public messages, there's no | 
|  | guarantee of getting a message for certain scenarios (yields).  Instead, use an | 
|  | ev_q with INDIR set. | 
|  |  | 
|  | Also note that a 2LS could set up a big ev_q with EVENT_IPI and not EVENT_INDIR, | 
|  | and then poll for that in their vcore_entry().  This is equivalent to setting up | 
|  | a small ev_q with EVENT_IPI and pointing it at the private mbox. | 
|  |  | 
|  | 3.4 Application-specific Event Handling | 
|  | --------------------------------------- | 
|  | So what happens when the vcore/2LS isn't handling an event queue, but has been | 
|  | "told" about it?  This "telling" is in the form of an IPI.  The vcore was | 
|  | prodded, but is not supposed to handle the event.  This is actually what | 
|  | happens now in Linux when you send signals for AIO.  It's all about who (which | 
|  | thread, in their world) is being interrupted to process the work in an | 
|  | application specific way.  The app sets the handler, with the option to have a | 
|  | thread spawned (instead of a sighandler), etc. | 
|  |  | 
|  | This is not exactly the same as the case above where the ev_mbox* pointed to | 
|  | the vcore's default mbox.  That issue was just about avoiding extra messages | 
|  | (and messages in weird orders).  A vcore won't handle an ev_q if the | 
|  | message/contents of the queue aren't meant for the vcore/2LS.  For example, a | 
|  | thread can want to run its own handler, perhaps because it performs its own | 
|  | asynchronous I/O (compared to relying on the 2LS to schedule synchronous | 
|  | blocking u_threads). | 
|  |  | 
|  | There are a couple ways to handle this.  Ultimately, the application is supposed | 
|  | to handle the event.  If it asked for an IPI, it is because something ought to | 
|  | be done, which really means running a handler.  We used to support the | 
|  | application setting EVENT_THREAD in the ev_q's flags, and the 2LS would spawn a | 
|  | thread to run the ev_q's handler.  Now we just have the application block a | 
|  | uthread on the evq.  If an ev_handler is set, the vcore will execute the | 
|  | handler itself.  Careful with this, since the only memory it touches must be | 
|  | pinned, the function must not block (this is only true for the handlers called | 
|  | directly out of vcore context), and it should return quickly. | 
|  |  | 
|  | Note that in either case, vcore-written code (library code) does not look at | 
|  | the contents of the notification event.  Also note the handler takes the whole | 
|  | event_queue, and not a specific message.  It is more flexible, can handle | 
|  | multiple specific events, and doesn't require the vcore code to dequeue the | 
|  | event and either pass by value or allocate more memory. | 
|  |  | 
|  | These ev_q handlers are different than ev_handlers.  The former handles an | 
|  | event_queue.  The latter is the 2LS's way to handle specific types of messages. | 
|  | If an app wants to process specific messages, have them sent to an ev_q under | 
|  | its control; don't mess with ev_handlers unless you're the 2LS (or example | 
|  | code). | 
|  |  | 
|  | Continuing the analogy between vcores getting IPIs and the OS getting HW | 
|  | interrupts, what goes on in vcore context is like what goes on in interrupt | 
|  | context, and the threaded handler is like running a threaded interrupt handler | 
|  | (in Linux).  In the ROS world, it is like having the interrupt handler kick | 
|  | off a kernel message to defer the work out of interrupt context. | 
|  |  | 
|  | If neither of the application-specific handling flags are set, the vcore will | 
|  | respond to the IPI by attempting to handle the event on its own (lookup table | 
|  | based on the type of event (like "syscall complete")).  If you didn't want the | 
|  | vcore to handle it, then you shouldn't have asked for an IPI.  Those flags are | 
|  | the means by which the vcore can distinguish between its event_qs and the | 
|  | applications.  It does not make sense otherwise to send the vcore an IPI and | 
|  | an event_q, but not tell give the code the info it needs to handle it. | 
|  |  | 
|  | In the future, we might have the ability to block a u_thread on an event_q, so | 
|  | we'll have other EV_ flags to express this, and probably a void*.  This may | 
|  | end up being redudant, since u_threads will be able to block on syscalls (and | 
|  | not necessarily IPIs sent to vcores). | 
|  |  | 
|  | As a side note, a vcore can turn off the IPI wanted flag at any time.  For | 
|  | instance, when it spawns a thread to handle an ev_q, the vcore can turn off | 
|  | IPI wanted on that event_q, and the thread handler can turn it back on when it | 
|  | is done processing and wants to be re-IPId.  The reason for this is to avoid | 
|  | taking future IPIs (once we leave vcore context, IPIs are enabled) to let us | 
|  | know about an event for which a handler is already running. | 
|  |  | 
|  | 3.5 Overflowed/Missed Messages in the VCPD | 
|  | --------------------------------------- | 
|  | This too is no longer necessary.  It's useful in that it shows what we don't | 
|  | have to put up with.  Missing messages requires potentially painful | 
|  | infrastructure to handle it: | 
|  |  | 
|  | ----------------------------- | 
|  | All event_q's requesting IPIs ought to register with the 2LS.  This is for | 
|  | recovering in case the vcpd's mbox overflowed, and the vcore knows it missed a | 
|  | NE_EVENT type message.  At that point, it would have to check all of its | 
|  | IPI-based queues.  To do so, it could check to see if the mbox has any | 
|  | messages, though in all likelihood, we'll just act as if there was a message | 
|  | on each of the queues (all such handlers should be able to handle spurious | 
|  | IPIs anyways).  This is analagous to how the OS's block drivers don't solely | 
|  | rely on receiving an interrupt (they deal with it via timeouts).  Any user | 
|  | code requiring an IPI must do this.  Any code that runs better due to getting | 
|  | the IPI ought to do this. | 
|  |  | 
|  | We could imagine having a thread spawned to handle an ev_q, and the vcore | 
|  | never has to touch the ev_q (which might make it easier for memory | 
|  | allocation).  This isn't a great idea, but I'll still explain it.  In the | 
|  | notif_ev message sent to the vcore, it has the event_q*.  We could also send a | 
|  | flag with the same info as in the event_q's flags, and also send the handler. | 
|  | The problem with this is that it isn't resilient to failure.  If there was a | 
|  | message overflow, it would have the check the event_q (which was registered | 
|  | before) anyway, and could potentially page fault there.  Also the kernel would | 
|  | have faulted on it (and read it in) back when it tried to read those values. | 
|  | It's somewhat moot, since we're going to have an allocator that pins event_qs. | 
|  | ----------------------------- | 
|  |  | 
|  | 3.6 Round-Robin or Other IPI-delivery styles | 
|  | --------------------------------------- | 
|  | In the same way that the IOAPIC can deliver interrupts to a group of cores, | 
|  | round-robinning between them, so can we imagine processes wanting to | 
|  | distribute the IPI/active notification of events across its vcores.  This is | 
|  | only meaningful is the NOTIF_IPI_WANTED flag is set. | 
|  |  | 
|  | Eventually we'll support this, via a flag in the event_q.  When | 
|  | NE_ROUND_ROBIN, or whatever, is set a couple things will happen.  First, the | 
|  | vcore field will be used in a "delivery-specific" manner.  In the case of RR, | 
|  | it will probably be the most recent destination.  Perhaps it will be a bitmask | 
|  | of vcores available to receive.  More important is the event_mbox*.  If it is | 
|  | set, then the event message will be sent there.  Whichever vcore gets selected | 
|  | will receive an IPI, and its vcpd mbox will get a NE_EVENT message.  If the | 
|  | event_mbox* is 0, then the actual message will get delivered to the vcore's | 
|  | vcpd mbox (the default location). | 
|  |  | 
|  | 3.7 Event_q-less Notifications | 
|  | --------------------------------------- | 
|  | Some events needs to be delivered directly to the vcore, regardless of any | 
|  | event_qs.  This happens currently when we bypass the notification table (e.g., | 
|  | sys_self_notify(), preemptions, etc).  These notifs will just use the vcore's | 
|  | default mbox.  In essence, the ev_q is being generated/sent with the call. | 
|  | The implied/fake ev_q points to the vcpd's mbox, with the given vcore set, and | 
|  | with IPI_WANTED set.  It is tempting to make those functions take a | 
|  | dynamically generated ev_q, though more likely we'll just use the lower level | 
|  | functions in the kernel, much like the Round Robin set will need to do.  No | 
|  | need to force things to fit just for the sake of using a 'solution'.  We want | 
|  | tools to make solutions, not packaged solutions. | 
|  |  | 
|  | 3.8 UTHREAD_DONT_MIGRATE | 
|  | --------------------------------------- | 
|  | DONT_MIGRATE exists to allow uthreads to disable notifications/IPIs and enter | 
|  | vcore context.  It is needed since you need to read vcoreid to disable notifs, | 
|  | but once you read it, you need to not move to another vcore.  Here are a few | 
|  | rules/guidelines. | 
|  |  | 
|  | We turn off the flag so that we can disable notifs, but turn the flag back on | 
|  | before enabling.  The thread won't get migrated in that instant since notifs are | 
|  | off.  But if it was the other way, we could miss a message (because we skipped | 
|  | an opportunity to be dropped into vcore context to read a message). | 
|  |  | 
|  | Don't check messages/handle events when you have a DONT_MIGRATE uthread.  There | 
|  | are issues with preemption recovery if you do.  In short, if two uthreads are | 
|  | both DONT_MIGRATE with notifs enabled on two different vcores, and one vcore | 
|  | gets preempted while the other gets an IPI telling it to recover the other one, | 
|  | both could keep bouncing back and forth if they handle their preemption | 
|  | *messages* without dealing with their own DONT_MIGRATEs first.  Note that the | 
|  | preemption recovery code can handle having a DONT_MIGRATE thread on the vcore. | 
|  | This is a special case, and it is very careful about how cur_uthread works. | 
|  |  | 
|  | All uses of DONT_MIGRATE must reenable notifs (and check messages) at some | 
|  | point.  One such case is uthread_yield().  Another is mcs_unlock_notifsafe(). | 
|  | Note that mcs_notif_safe locks have uthreads that can't migrate for a | 
|  | potentially long time.  notifs are also disabled, so it's not a big deal.  It's | 
|  | basically just the same as if you were in vcore context (though technically you | 
|  | aren't) when it comes to preemption recovery: we'll just need to restart the | 
|  | vcore via a syscall.  Also note that it would be a real pain in the ass to | 
|  | migrate a notif_safe locking uthread.  The whole point of it is in case it grabs | 
|  | a lock that would be held by vcore context, and there's no way to know it isn't | 
|  | a lock on the restart-path. | 
|  |  | 
|  | 3.9 Why Preemption Handling Doesn't Lock Up (probably) | 
|  | --------------------------------------- | 
|  | One of the concerns of preemption handling is that we don't get into some form | 
|  | of livelock, where we ping-pong back and forth between vcores (or a set of | 
|  | vcores), all of which are trying to handle each other's preemptions.  Part of | 
|  | the concern is that when a vcore sys_changes to another, it can result in | 
|  | another preemption message being sent.  We want to be sure that we're making | 
|  | progress, and not just livelocked doing sys_change_vcore()s. | 
|  |  | 
|  | A few notes first: | 
|  | 1) If a vcore is holding locks or otherwise isn't handling events and is | 
|  | preempted, it will let go of its locks before it gets to the point of | 
|  | attempting to handle any other vcore preemption events.  Event handling is only | 
|  | done when it is okay to never return (meaning no locks are held).  If this is | 
|  | the situation, eventually it'll work itself out or get to a potential ping-pong | 
|  | scenario. | 
|  |  | 
|  | 2) When you change_to while handling preemption, once you start back up, you | 
|  | will leave change_to and eventually fetch a new event.  This means any | 
|  | potential ping-pong needs to happen on a fresh event. | 
|  |  | 
|  | 3) If there are enough pcores for the vcores to all run, we won't issue any | 
|  | change_tos, since the vcores are no longer preempted.  This means we only are | 
|  | worried about situations with insufficient vcores.  We'll mostly talk about 1 | 
|  | pcore and 2 vcores. | 
|  |  | 
|  | 4) Preemption handlers will not call change_to on their target vcore if they | 
|  | are also the one STEALING from that vcore.  The handler will stop STEALING | 
|  | first. | 
|  |  | 
|  | So the only way to get stuck permanently is if both cores are stuck doing a | 
|  | sys_change_to(FALSE).  This means we want to become the other vcore, *and* we | 
|  | need to restart our vcore where it left off.  This is due to some invariant | 
|  | that keeps us from abandoning vcore context.  If we were to abandon vcore | 
|  | context (with a sys_change_to(TRUE)), we basically don't need to be | 
|  | preempt-recovered.  We already packaged up our cur_uthread, and we know we | 
|  | aren't holding any locks or otherwise breaking any invariants.  The system will | 
|  | work fine if we never run again.  (Someone just needs to check our messages). | 
|  |  | 
|  | Now, there are only two cases where we will do a sys_change_to(FALSE) *while* | 
|  | handling preemptions.  Again, we aren't concerned about things like MCS-PDR | 
|  | locks; those all work because the change_tos are done where we'd normally just | 
|  | busy loop.  We are only concerned about change_tos during handle_vc_preempt. | 
|  | These two cases are when the changing/handling vcore has a DONT_MIGRATE uthread | 
|  | or when someone else is STEALING its uthread.  Note that both of these cases | 
|  | are about the calling vcore, not its target. | 
|  |  | 
|  | If a vcore (referred to as "us") has a DONT_MIGRATE uthread and it is handling | 
|  | events, it is because someone else is STEALING from our vcore, and we are in | 
|  | the short one-shot event handling loop at the beginning of | 
|  | uthread_vcore_entry().  Whichever vcore is STEALING will quickly realize it | 
|  | can't steal (it sees the DONT_MIGRATE), and bail out.  If that vcore isn't | 
|  | running now, we will change_to it (which is the purpose of our handling their | 
|  | preemption).  Once that vcore realizes it can't steal, it will stop STEALING | 
|  | and change to us.  At this point, no one is STEALING from us, and we move along | 
|  | in the code.  Specifically, we do *not* handle events (we now have an event | 
|  | about the other vcore being preempted when it changed_to us), and instead we | 
|  | start up the DONT_MIGRATE uthread and let it run until it is migratable, at | 
|  | which point we handle events and will deal with the other vcore. | 
|  |  | 
|  | So DONT_MIGRATE will be sorted out.  Likewise, STEALING gets sorted out too, | 
|  | quite easily.  If someone is STEALING from us, they will quickly stop STEALING | 
|  | and change to us.  There are only two ways this could even happen: they are | 
|  | running concurrently with us, and somehow saw us out of vcore context before | 
|  | deciding to STEAL, or they were in the process of STEALING and got preempted by | 
|  | the kernel.  They would not have willingly stopped running while STEALING our | 
|  | cur_uthread.  So if we are running and someone is stealing, after a round of | 
|  | change_tos, eventually they run, and stop STEALING. | 
|  |  | 
|  | Note that once someone stops STEALING from us, they will not start again, | 
|  | unless we leave vcore context.  If that happened, we basically broke out of the | 
|  | ping-pong, and now we're onto another set of preemptions.  We wouldn't leave | 
|  | vcore context if we still had preemption events to deal with. | 
|  |  | 
|  | Finally, note that we needed to only check for one message at a time at the | 
|  | beginning of uthread_vcore_entry().  If we just handled the entire mbox without | 
|  | checking STEALING, then we might not break out of that loop if there is a | 
|  | constant supply of messages (perhaps from a vcore in a similar loop). | 
|  |  | 
|  | Anyway, that's the basic plan behind the preemption handler and how we avoid | 
|  | the ping-ponging.  change_to_vcore() is built so that we handle our own | 
|  | preemption before changing (pack up our current uthread), so that we make | 
|  | progress.  The two cases where we can't do that get sorted out after everyone | 
|  | gets to run once, and since you can't steal or have other uthread's turn on | 
|  | DONT_MIGRATE while we're in vcore context, eventually we clear everything up. | 
|  | There might be other bugs or weird corner cases, possibly involving multiple | 
|  | vcores, but I think we're okay for now. | 
|  |  | 
|  | 3.10: Handling Messages for Other Vcores | 
|  | --------------------------------------- | 
|  | First, remember that when a vcore handles an event, there's no guarantee that | 
|  | the vcore will return from the handler.  It may start fresh in vcore_entry(). | 
|  |  | 
|  | The issue is that when you handle another vcore's INDIRs, you may handle | 
|  | preemption messages.  If you have to do a change_to, the kernel will make sure | 
|  | a message goes out about your demise.  Thus someone who recovers that will | 
|  | check your public mbox.  However, the recoverer won't know that you were | 
|  | working on another vcore's mbox, so those messages might never be checked. | 
|  |  | 
|  | The way around it is to send yourself a "check the other guy's messages" event. | 
|  | When we might change_to and never return, if we were dealing with another | 
|  | vcores mbox, we'll send ourselves a message to finish up that mbox (if there | 
|  | are any messages left).  Whoever reads our messages will eventually get that | 
|  | message, and deal with it. | 
|  |  | 
|  | One thing that is a little ugly is that the way you deal with messages two | 
|  | layers deep is to send yourself the message.  So if VC1 is handling VC2's | 
|  | messages, and then wants to change_to VC3, VC1 sends a message to VC1 to check | 
|  | VC2.  Later, when VC3 is checking VC1's messages, it'll handle the "check VC2's messages" | 
|  | message.  VC3 can't directly handle VC2's messages, since it could run a | 
|  | handler that doesn't return.  Nor can we just forget about VC2.  So VC3 sends | 
|  | itself a message to check VC2 later.  Alternatively, VC3 could send itself a | 
|  | message to continue checking VC1, and then move on to VC2.  Both seem | 
|  | equivalent.  In either case, we ought to check to make sure the mbox has | 
|  | something before bothering sending the message. | 
|  |  | 
|  | So for either a "change_to that might not return" or for a "check INDIRs on yet | 
|  | another vcore", we send messages to ourself so that we or someone else will | 
|  | deal with it. | 
|  |  | 
|  | Note that we use TLS to track whether or not we are handling another vcore's | 
|  | messages, and if we do plan to change_to that might not return, we clear the | 
|  | bool so that when our vcore starts over at vcore_entry(), it starts over and | 
|  | isn't still checking someone elses message. | 
|  |  | 
|  | As a reminder of why this is important: these messages we are hunting down | 
|  | include INDIRs, specifically ones to ev_qs such as the "syscall completed | 
|  | ev_q".  If we never get that message, a uthread will block forever.  If we | 
|  | accidentally yield a vcore instead of checking that message, we would end up | 
|  | yielding the process forever since that uthread will eventually be the last | 
|  | one, but our main thread is probably blocked on a join call.  Our process is | 
|  | blocked on a message that already came, but we just missed it. | 
|  |  | 
|  | 4. Single-core Process (SCP) Events: | 
|  | ==================== | 
|  | 4.1 Basics: | 
|  | --------------------------------------- | 
|  | Event delivery is important for SCP's blocking syscalls.  It can also be used | 
|  | (in the future) to deliver POSIX signals, which would just be another kernel | 
|  | event. | 
|  |  | 
|  | SCPs can receive events just like MCPs.  For the most part, the code paths are | 
|  | the same on both sides of the K/U interface.  The kernel sends events (which | 
|  | can detect an SCP and will send it to vcore0), the kernel will make sure you | 
|  | can't yield/miss an event, etc.  Userspace preps vcore context in advance, and | 
|  | can do all the things vcore context does: handle events, select thread to run. | 
|  | For an SCP, there is only one thread to run. | 
|  |  | 
|  | 4.2 Degenerate Event Delivery: | 
|  | --------------------------------------- | 
|  | That being said, there are a few tricky things.  First, there is a time before | 
|  | the SCP is ready to fully receive events.  Specifically, before | 
|  | vcore_event_init(), which is called out of glibc's _start.  More importantly, | 
|  | the runtime linker never calls that function, yet it wants to block. | 
|  |  | 
|  | The important thing to note is that there are a few parts to event delivery: | 
|  | registration (user), sending the event (kernel), making sure the proc wakes up | 
|  | (kernel), and actually handling the event (user).  For syscalls, the only thing | 
|  | the process (even rtld) needs is the first three.  Registration is easy - can be | 
|  | done with nothing more than kernel headers (no need for parlib) for NO_MSG ev_qs | 
|  | (no need to init the UCQ).  Event handling is trickier, and requires parlib | 
|  | (which rtld can't link against).  To support processes that could register for | 
|  | events, but not handle them (or even enter vcore context), the kernel needed a | 
|  | few changes (checking the VC_SCP_NOVCCTX flag) so that it would wake the | 
|  | process, but never put it in vcore context. | 
|  |  | 
|  | This degenerate event handling just wakes the process up, at which point it can | 
|  | check on its syscall.  Very early in the process's life, it'll init vcore0's | 
|  | UCQ and be able to handle full events, enter vcore context, etc. | 
|  |  | 
|  | Once the SCP is up and running, it can receive events like normal.  One thing to | 
|  | note is that the SCPs are not using a handle_syscall() event handler, like the | 
|  | MCPs do.  They are only using the event to get the process restarted, at which | 
|  | point their vcore 0 restarts thread0.  One consequence of this is that if a | 
|  | process receives an unrelated event while blocking on a syscall, it'll handle | 
|  | that event, then restart thread0.  Thread0 will see its syscall isn't complete, | 
|  | and then re-block.  (It also re-registers its ev_q, which is harmless).  When | 
|  | that syscall is finally done, the kernel will send an event and wake it up | 
|  | again. | 
|  |  | 
|  | 4.3 Extra Tidbits: | 
|  | --------------------------------------- | 
|  | If we receive an event right as we transition from SCP to MCP, vcore0 could get | 
|  | spammed with a message that is never received.  Right now, it's not a problem, | 
|  | since vcore0 is the first vcore that will get woken up as an MCP.  This could be | 
|  | an issue if we ever allow transitions from MCP back to SCP. | 
|  |  | 
|  | On a related note, it's now wrong for SCPs to sys_yield(FALSE) (not being nice, | 
|  | meaning they are waiting for an event) in a loop that does not check events or | 
|  | otherwise allow them to break out of that loop.  This should be fairly obvious. | 
|  | A little more subtle is that these loops also need to sort out notif_pending. | 
|  | If you are trying to yield and still have an old notif_pending set, the kernel | 
|  | won't let you yield (it thinks you are missing the notif).  For the degenerate | 
|  | mode, (VC_SCP_NOVCCTX is set on vcore0), the kernel will handle dealing with | 
|  | this flag. | 
|  |  | 
|  | Finally, note that while the SCP is in vcore context, it has none of the | 
|  | guarantees of an MCP.  It's somewhat meaningless to talk about being gang | 
|  | scheduled or knowing about the state of other vcores.  If you're running, you're | 
|  | on a physical core.  You may get unexpected interrupts, descheduled, etc.  Aside | 
|  | from the guarantees and being the only vcore, the main differences are really up | 
|  | to the kernel scheduler.  In that sense, we have somewhat of a new state for | 
|  | processes - SCPs that can enter vcore context.  From the user's perspective, | 
|  | they look a lot like an MCP, and the degenerate/early mode SCPs are like the | 
|  | old, dumb SCPs.  The big difference for userspace is that there isn't a 2LS yet | 
|  | (will need to reinit things slightly).  The kernel treats SCPs and MCPs very | 
|  | differently too, but that may not always be the case. | 
|  |  | 
|  | 5. Misc Things That Aren't Sorted Completely: | 
|  | ==================== | 
|  | 5.1 What about short handlers? | 
|  | --------------------------------------- | 
|  | Once we sort the other issues, we can ask for them via a flag in the event_q, | 
|  | and run the handler in the event_q struct. | 
|  |  | 
|  | 5.2 What about blocking on a syscall? | 
|  | --------------------------------------- | 
|  | The current plan is to set a flag, and let the kernel go from there.  The | 
|  | kernel knows which process it is, since that info is saved in the kthread that | 
|  | blocked.  One issue is that the process could muck with that flag and then go | 
|  | to sleep forever.  To deal with that, maybe we'd have a long running timer to | 
|  | reap those.  Arguably, it's like having a process while(1).  You can screw | 
|  | yourself, etc.  Killing the process would still work. |