| async_events.txt |
| Barret Rhoden |
| |
| 1. Overview |
| 2. Async Syscalls and I/O |
| 3. Event Delivery / Notification |
| 4. Single-core Process (SCP) Events |
| 5. Misc Things That Aren't Sorted Completely: |
| |
| 1. Overview |
| ==================== |
| 1.1 Event Handling / Notifications / Async IO Issues: |
| ------------------------------------------------------------------ |
| Basically, syscalls use the ROS event delivery mechanisms, redefined and |
| described below. Syscalls use the event delivery just like any other |
| subsystem would that wants to deliver messages to a process. The only other |
| example we have right now are the "kernel notifications", which are the |
| one-sided, kernel-initiated messages that the kernel sends to a process. |
| |
| Overall, there are several analogies from how vcores work to how the OS |
| handles interrupts. This is a result of trying to make vcores run like |
| virtual multiprocessors, in control of their resources and aware of the lower |
| levels of the system. This analogy has guided much of how the vcore layer |
| works. Whenever we have issues with the 2-lsched, realize the amount of |
| control they want means using solutions that the OS must do too. |
| |
| Note that there is some pointer chasing going on, though we try to keep it to |
| a minimum. Any time the kernel chases a pointer, it needs to make sure it is |
| in the R/W section of userspace, though it doesn't need to check if the page |
| is present. There's more info in the Page Fault sections of the |
| documentation. (Briefly, if the kernel PFs on a user address, it will either |
| block and handle the PF, or if the address was unmapped, it will kill the |
| process). |
| |
| 1.2 Some Definitions: |
| --------------------------------------- |
| ev_q, event_queue, event_q: all terms used interchangeably with each other. |
| They are the endpoint for communicating messages to a process, encapsulating |
| the method of delivery (such as IPI or not) with where to save the message. |
| |
| Vcore context: the execution context of the virtual core on the "trampoline" |
| stack. All executions start from the top of this stack, and no stack state is |
| saved between vcore_entry() calls. All executions on here are non-blocking, |
| notifications (IPIs) are disabled, and there is a specific TLS loaded. Vcore |
| context is used for running the second level scheduler (2LS), swapping between |
| threads, and handling notifications. It is analagous to "interrupt context" |
| in the OS. Any functions called from here should be brief. Any memory |
| touched must be pinned. In Lithe terms, vcore context might be called the |
| Hart / hard thread. People often wonder if they can run out of vcore context |
| directly. Technically, you can, but you lose the ability to take any fault |
| (page fault) or to get IPIs for notification. In essence, you lose control, |
| analgous to running an application in the kernel with preemption/interrupts |
| disabled. See the process documentation for more info. |
| |
| 2LS: is the second level scheduler/framework. This code executes in vcore |
| context, and is Lithe / plugs in to Lithe (eventually). Often used |
| interchangeably with "vcore context", usually when I want to emphasize the |
| scheduling nature of the code. |
| |
| VCPD: "virtual core preemption data". In procdata, there is an array of |
| struct preempt_data, one per vcore. This is the default location to look for |
| all things related to the management of vcores, such as its event_mbox (queue |
| of incoming messages/notifications/events). Both the kernel and the vcore |
| code know to look here for a variety of things. |
| |
| Vcore-business: This is a term I use for a class of messages where the receiver |
| is the actual vcore, and not just using the vcore as a place to receive the |
| message. Examples of vcore-business are INDIR events, preempt_pending events, |
| scheduling events (self-ipis by the 2LS from one vcore to another), and things |
| like that. There are two types: public and private. Private will only be |
| handled by that vcore. Public might be handled by another vcore. |
| |
| Notif_table: This is a list of event_q*s that correspond to certain |
| unexpected/"one-sided" events the kernel sends to the process. It is similar |
| to an IRQ table in the kernel. Each event_q tells the kernel how the process |
| wants to be told about the specific event type. |
| |
| Notifications: used to be a generic event, but now used in terms of the verb |
| 'notify' (do_notify()). In older docs, passive notification is just writing a |
| message somewhere. Active notification is an IPI delivered to a vcore. I use |
| that term interchangeably with an IPI, and usually you can tell by context |
| that I'm talking about an IPI going to a process (and not just the kernel). |
| The details of it make it more complicated than just an IPI, but it's |
| analagous. I've start referring to notification as the IPI, and "passive |
| notification" as just events, though older documentation has both meanings. |
| |
| BCQ: "bounded concurrent queue". It is a fixed size array of messages |
| (structs of notification events, or whatever). It is non-blocking, supporting |
| multiple producers and consumers, where the producers do not trust the |
| consumers. It is the primary mechanism for the kernel delivering message |
| payloads into a process's address space. Note that producers don't trust each |
| other either (in the event of weirdness, the producers give up and say the |
| buffer is full). This means that a process can produce for one of its ev_qs |
| (which is what they need to do to send message to itself). |
| |
| UCQ: "unbounded concurrent queue". This is a data structure allowing the kernel |
| to produce an unbounded number of messages for the process to consume. The main |
| limitation to the number of messages is RAM. Check out its documentation. |
| |
| 2. Async Syscalls and I/O |
| ==================== |
| 2.1 Basics |
| ---------------------------------------------- |
| The syscall struct is the contract for work with the kernel, including async |
| I/O. Lots of current OS async packages use epoll or other polling systems. |
| Note the distinction between Polling and Async I/O. Polling is about finding |
| out if a call will block. It is primarily used for sockets and pipes. It |
| does relatively nothing for disk I/O, which requires a separate async I/O |
| system. By having all syscalls be async, we can make polling a bit easier and |
| more unified with the generic event code that we use for all syscalls. |
| |
| For instance, we can have a sys_poll syscall, which is async just like any |
| other syscall. The call can be a "one shot / non-blocking", like the current |
| systems polling code, or it can also notify on change (not requiring future |
| polls) via the event_q mechanisms. If you don't want to be IPId, you can |
| "poll" the syscall struct - not requiring another kernel crossing/syscall. |
| |
| Note that we do not tie syscalls and polling to FDs. We do events on |
| syscalls, which can be used to check FDs. I think a bunch of polling cases |
| will not be needed once we have async syscalls, but for those that remain, |
| we'll have sys_poll() (or whatever). |
| |
| To receive an event on a syscall completion or status change, just fill in the |
| event_q pointer. If it is 0, the kernel will assume you poll the actual |
| syscall struct. |
| |
| struct syscall { |
| current stuff /* arguments, retvals */ |
| struct ev_queue * /* struct used for messaging, including IPIs*/ |
| void * /* used by 2LS, usually a struct u_thread * */ |
| } |
| |
| One issue with async syscalls is that there can be too many outstanding IOs |
| (normally sync calls provide feedback / don't allow you to over-request). |
| Eventually, processes can exhaust kernel memory (the kthreads, specifically). |
| We need a way to limit the kthreads per proc, etc. Shouldn't be a big deal. |
| |
| Normally, we talk about changing the flag in a syscall to SC_DONE. Async |
| syscalls can be SC_PROGRESS (new stuff happened on it), which can trigger a |
| notification event. Some calls, like AIO or bulk accept, exist for a while |
| and slowly get filled in / completed. In the future, we'll also want a way to |
| abort the in-progress syscalls (possibly any syscall!). |
| |
| 2.2 Uthreads Blocking on Syscalls |
| ---------------------------------------------- |
| Many threading libraries will want some notion of a synchronous, blocking |
| thread. These threads use regular I/O calls, which are async under the hood, |
| but don't want to bother with call backs or other details of async I/O. In |
| this section, I'll talk a bit about how this works, esp regarding |
| uthreads/pthreads. |
| |
| 'Blocking' refers to user threads, and has nothing to do with an actual |
| process blocking/waiting on some kernel event. The kernel does not know |
| anything about what goes on here. While a bit confusing, this allows |
| applications to do whatever they want on top of an async interface, and is a |
| consequence of decoupling cores from user-threads from kthreads. |
| |
| 2.2.1 Basics of Uthread Blocking |
| --------------- |
| When a thread calls a glibc function that makes a system call, if the syscall |
| is not yet complete when the kernel returns to userspace, glibc will check for |
| the existence of a second level scheduler and attempt to use it to yield its |
| uthread. If there is no 2LS, the code just spins for now. Eventually, it |
| will try to suspend/yield the process for a while (til the call is done), aka, |
| block in the kernel. |
| |
| If there is a 2LS, the current thread will yield, and call out to the 2LS's |
| blockon_sysc() method, which needs a way to stop the thread and be able to |
| restart it when the syscall completes. Specifically, the pthread 2LS registers |
| the syscall to respond to an event (described in detail elsewhere in this doc). |
| When the event comes in, meaning the syscall is complete, the thread is put on |
| the runnable list. |
| |
| Details: |
| - A pointer to the struct pthread is stored in the syscall's void*. When the |
| syscall is done, we normally get a message from the kernel, and the payload |
| tells us the syscall is done, which tells us which thread to unblock. |
| - The pthread code also always asks for an IPI and event message for every |
| syscall that completes. This is far from ideal. Still, the basics are the |
| same for any threading library. Once you know a thread is done, you need to |
| do something about it. |
| - The pthread code does syscall blocking and event notification on a per-core |
| basis. Using the default (VCPD) ev_mbox for this is a bad idea (which we did |
| at some point). |
| - There's a race between the 2LS trying to sign up for events and the kernel |
| finishing the event. We handle this in uthread code, so use the helper to |
| register_evq(), which does the the right thing (atomics, careful ordering |
| with writes, etc). |
| |
| 2.2.1 Recovering from Event Overflow |
| --------------- |
| Event overflow recovery is unnecessary, since syscall ev_qs use UCQs now. this |
| section is kept around for some useful tidbits, such as details about |
| deregistering ev_qs for a syscall: |
| |
| --------------------------- |
| The pthread code expects to receive an event somehow to unblock a thread |
| once its syscall is done. One limitation to our messaging systems is that you |
| can't send an infinite amount of event messages. (By messages, I mean a chunk |
| of memory with a payload, in this case consisting of a struct syscall *). |
| Event delivery degrades to a bit in the case of the message queue being full |
| (more details on that later). |
| |
| The pthread code (and any similar 2LS) needs to handle waking up syscalls when |
| the event message was lost and all we know is that some syscall that was meant |
| to have a message sent to a particular event queue (per-core in the case of |
| pthread stuff (actually the VCPD for now)). The basic idea is to poll all |
| outstanding system calls and unblock whoever is done. |
| |
| The key problem is due to a race: for a given syscall we don't know if we're |
| going to get a message for a syscall or not. There could be a completion |
| message in the queue for the syscall while we are going through the list of |
| blocked threads. If we assume we already got the message (or it was lost in |
| the overflow), but didn't really, then if we finish as SC and free its memory |
| (free or return up the stack), we could later get a message for it, and all |
| sorts of things would go wrong (like trying to unblock a pointer that is |
| gibberish). |
| |
| Here's what we do: |
| 1) Set a "handling overflow" flag so we don't recurse. |
| 2) Turn off event delivery for all syscalls on our list |
| 3) Handle any event messages. This is how we make a distinction between |
| finished syscalls that had a message sent and those that didn't. We're doing |
| the message-sent ones here. |
| 4) For any left on the list, check to see if they are done. We actually do |
| this by attempting to turn on event delivery for them. Turning on event |
| delivery can fail if the call is already done. So if it fails, they are done |
| and we unblock them (similar to how we block the threads in the first place). |
| If it doesn't fail, they are now ready to receive messages. This can be |
| tweaked a bit. |
| 5) Unset the overflow-handling flag. |
| |
| One thing to be careful of is that when we turn off event delivery, you need to |
| be sure the kernel isn't in the process of sending an event. This is why we |
| have the SC_K_LOCK syscall flag. Uthread code will not consider deregistration |
| complete while that flag is set, since the kernel is still mucking with the |
| syscall (and sending an event). Once the flag is clear, the event has been |
| delivered (the ev_msg is in the ev_mbox), and our assumptions remain true. |
| |
| There are a couple implications of this style. If you have a shared event |
| queue (with other event sources), those events can get mixed in with the |
| recovery. Don't leave the vcore context due to other events. This'll |
| probably need work. The other thing is that completed syscalls can get |
| handled in a different order than they were signaled. Shouldn't be a big |
| deal. |
| |
| Note on the overflow handling flag and unsetting it. There should not be any |
| races with this. The flag prevented us from handling overflows on the event |
| queue. Other than when we checked for events that had been succesfully sent, |
| we didn't try to handle events. We can unset the flag, and at that point we |
| can start handling missed events. If there was an overflow after we last |
| checked the list, but before we cleared the overflow-handling flag, we'll |
| still catch it since we haven't tried handling events in between checking the |
| list and clearing the flag. That flag doesn't even matter until we want to |
| handle_events, so we aren't missing anything. the next handle_events() will |
| deal with everything from scratch. |
| |
| For blocking threads that block concurrently with the overflow handling: in |
| the pthread case, this can't happen since everything is per-vcore. If you do |
| have process-wide thread blocking/syscall management, we can add new ones, but |
| they must have event delivery turned off when they are added to the list. And |
| you'll need to lock the list, etc. This should work in part due to new |
| syscalls being added to the end of the list, and the overflow-handler |
| proceeding linearly through the list. |
| |
| Also note that we shouldn't handle the event for unblocking a syscall on a |
| different core than the one it was submitted to. This could result in |
| concurrent modifications to the original core's TAILQ (bad). This restriction |
| is dependent on how a 2LS does its thread handling/blocking. |
| |
| Eventually, we'll want a way to detect and handle excessive overflow, since |
| it's probably quite expensive. Perhaps turn it off and periodically poll the |
| syscalls for completion (but don't bother turning on the ev_q). |
| --------------------------- |
| |
| 3. Event Delivery / Notification |
| ==================== |
| 3.1 Basics |
| ---------------------------------------------- |
| The mbox (mailbox) is where the actual messages go. |
| |
| struct ev_mbox { |
| bcq of notif_events /* bounded buffer, multi-consumer/producer */ |
| msg_bitmap |
| } |
| struct ev_queue { /* aka, event_q, ev_q, etc. */ |
| struct ev_mbox * |
| void handler(struct event_q *) |
| vcore_to_be_told |
| flags /* IPI_WANTED, RR, 2L-handle-it, etc */ |
| } |
| struct ev_queue_big { |
| struct ev_mbox * /* pointing to the internal storage */ |
| vcore_to_be_told |
| flags /* IPI_WANTED, RR, 2L-handle-it, etc */ |
| struct ev_mbox { } /* never access this directly */ |
| } |
| |
| The purpose of the big one is to simply embed some storage. Still, only |
| access the mbox via the pointer. The big one can be casted (and stored as) |
| the regular, so long as you know to dealloc a big one (free() knows, custom |
| styles or slabs would need some help). |
| |
| The ev_mbox says where to put the actual message, and the flags handle things |
| such as whether or not an IPI is wanted. |
| |
| Using pointers for the ev_q like this allows multiple event queues to use the |
| same mbox. For example, we could use the vcpd queue for both kernel-generated |
| events as well as async syscall responses. The notification table is actually |
| a bunch of ev_qs, many of which could be pointing to the same vcore/vcpd-mbox, |
| albeit with different flags. |
| |
| 3.2 Kernel Notification Using Event Queues |
| ---------------------------------------------- |
| The notif_tbl/notif_methods (kernel-generated 'one-sided' events) is just an |
| array of struct ev_queue*s. Handling a notification is like any other time |
| when we want to send an event. Follow a pointer, send a message, etc. As |
| with all ev_qs, ev_mbox* points to where you want the message for the event, |
| which usually is the vcpd's mbox. If the ev_q pointer is 0, then we know the |
| process doesn't want the event (equivalent to the older 'NOTIF_WANTED' flag). |
| Theoretically, we can send kernel notifs to user threads. While it isn't |
| clear that anyone will ever want this, it is possible (barring other issues), |
| since they are just events. |
| |
| Also note the flag EVENT_VCORE_APPRO. Processes should set this for certain |
| types of events where they want the kernel to send the event/IPI to the |
| 'appropriate' vcore. For example, when sending a message about a preemption |
| coming in, it makes sense for the kernel to send it to the vcore that is going |
| to get preempted, but the application could choose to ignore the notification. |
| When this flag is set, the kernel will also use the vcore's ev_mbox, ignoring |
| the process's choice. We can change this later, but it doesn't really make |
| sense for a process to pick an mbox and also say VCORE_APPRO. |
| |
| There are also interfaces in the kernel to put a message in an ev_mbox |
| regardless of the process's wishes (post_vcore_event()), and can send an IPI |
| at any time (proc_notify()). |
| |
| 3.3 IPIs, Indirection Events, and Fallback (Spamming Indirs) |
| ---------------------------------------------- |
| An ev_q can ask for an IPI, for an indirection event, and for an indirection |
| event to be spammed in case a vcore is offline (sometimes called the 'fallback' |
| option. Or any combination of these. Note that these have little to do with |
| the actual message being sent. The actual message is dropped in the ev_mbox |
| pointed to by the ev_q. |
| |
| The main use for all of this is for syscalls. If you want to receive an event |
| when a syscall completes or has a change in status, simply allocate an event_q, |
| and point the syscall at it. syscall: ev_q* -> "vcore for IPI, syscall message |
| in the ev_q mbox", etc. You can also point it to an existing ev_q. Pthread |
| code has examples of two ways to do this. Both have per vcore ev_qs, requesting |
| IPIs, INDIRS, and SPAM_INDIR. One way is to have an ev_mbox per vcore, and |
| another is to have a global ev_mbox that all ev_qs point to. As a side note, if |
| you do the latter, you don't need to worry about a vcore's ev_q if it gets |
| preempted: just check the global ev_mbox (which is done by checking your own |
| vcore's syscall ev_q). |
| |
| 3.3.1: IPIs and INDIRs |
| --------------- |
| An EVENT_IPI simply means we'll send an IPI to the given vcore. Nothing else. |
| This will usually be paired with an Indirection event (EVENT_INDIR). An INDIR |
| is a message of type EV_EVENT with an ev_q* payload. It means "check this |
| ev_q". Most ev_qs that ask for an IPI will also want an INDIR so that the vcore |
| knows why it was IPIed. You don't have to do this: for instance, your 2LS might |
| poll its own ev_q, so you won't need the indirection event. |
| |
| Additionally, note that IPIs and INDIRs can be spurious. It's not a big deal to |
| receive and IPI and have nothing to do, or to be told to check an empty ev_q. |
| All of the event handling code can deal with this. |
| |
| INDIR events are sent to the VCPD public mbox, which means they will get handled |
| if the vcore gets preempted. Any other messages sent here will also get handled |
| during a preemption. However, the only type of messages you should use this for |
| are ones that can handle spurious messages. The completion of a syscall is an |
| example of a message that cannot be spurious. Since INDIRs can be spurious, we |
| can use the public mbox. (Side note: the kernel may spam INDIRs in attempting |
| to make sure you get the message on a vcore that didn't yield.) |
| |
| Never use a VCPD mbox (public or private) for messages you might want to receive |
| if that vcore is offline. If you want to be sure to get a message, create your |
| own ev_q and set flags for INDIR, SPAM_INDIR, and IPI. There's no guarantee a |
| *specific* message will get looked at. In cases where it won't, the kernel will |
| send that message to another vcore. For example, if the kernel posts an INDIR |
| to a VCPD mbox (the public one btw) and it loses a race with the vcore yielding, |
| the vcore might never see that message. However, the kernel knows it lost the |
| race, and will find another vcore to send it to. |
| |
| 3.3.2: Spamming Indirs / Fallback |
| --------------- |
| Both IPI and INDIR need an actual vcore. If that vcore is unavailable and if |
| EVENT_SPAM_INDIR is set, the kernel will pick another vcore and send the |
| messages there. This allows an ev_q to be set up to handle work when the vcore |
| is online, while allowing the program to handle events when that core yields, |
| without having to reset all of its ev_qs to point to "known" available vcores |
| (and avoiding those races). Note 'online' is synonymous with 'mapped', when |
| talking about vcores. A vcore technically isn't always online, only destined |
| to be online, when it is mapped to a pcore (kmsg on the way, etc). It's |
| easiest to think of it being online for the sake of this discussion. |
| |
| One question is whether or not 2LSs need a SPAM_INDIR flag for their ev_qs. |
| The main use for SPAM_INDIR is so that vcores can yield. (Note that fallback |
| won't help you *miss* INDIR messages in the event of a preemption; you can |
| always lose that race due to it taking too long to process the messages). An |
| alternative would be for vcores to pick another vcore and change all of its |
| ev_qs to that vcore. There are a couple problems with this. One is that it'll |
| be a pain to get those ev_qs back when the vcore comes back online (if ever). |
| Another issue is that other vcores will build up a list of ev_qs that they |
| aren't aware of, which will be hard to deal with when *they* yield. SPAM_INDIR |
| avoids all of those problems. |
| |
| An important aspect of spamming indirs is that it works with yielded vcores, |
| not preempted vcores. It could be that there are no cores that are online, but |
| there should always be at least one core that *will* be online in the future, a |
| core that the process didn't want to lose and will deal with in the future. If |
| not for this distinction, SPAM_INDIR could fail. An older idea would be to have |
| fallback send the msg to the desired vcore if there were no others. This would |
| not work if the vcore yielded and then the entire process was preempted or |
| otherwise not running. Another way to put this is that we need a field to |
| determine whether a vcore is offline temporarily or permanently. |
| |
| This is why we have the VCPD flag 'VC_CAN_RCV_MSG'. It tells the kernel's event |
| delivery code that the vcore will check the messages: it is an acceptable |
| destination for a spammed indir. There are two reasons to put this in VCPD: |
| 1) Userspace can remotely turn off a vcore's msg reception. This is necessary |
| for handling preemption of a vcore that was in uthread context, so that we can |
| remotely 'yield' the core without having to sys_change_vcore() (which I discuss |
| below, and is meant to 'unstick' a vcore). |
| 2) Yield is simplified. The kernel no longer races with itself nor has to worry |
| about turning off that flag - userspace can do it when it wants to yield. (turn |
| off the flag, check messages, then yield). This is less big of a deal now that |
| the kernel races with vcore membership in the online_vcs list. |
| |
| Two aspects of the code make this work nicely. The VC_CAN_RCV_MSG flag greatly |
| simplifies the kernel's job. There are a lot of weird races we'd have to deal |
| with, such as process state (RUNNING_M), whether a mass preempt is going on, or |
| just one core, or a bunch of cores, mass yields, etc. A flag that does one |
| thing well helps a lot - esp since preemption is not the same as yielding. The |
| other useful thing is being able to handle spurious events. Vcore code can |
| handle extra IPIs and INDIRs to non-VCPD ev_qs. Any vcore can handle an ev_q |
| that is "non-VCPD business". |
| |
| Worth mentioning is the difference between 'notif_pending' and VC_CAN_RCV_MSG. |
| VC_CAN_RCV_MSG is the process saying it will check for messages. |
| 'notif_pending' is when the kernel says it *has* sent a message. |
| 'notif_pending' is also used by the kernel in proc_yield() and the 2LS in |
| pop_user_ctx() to make sure the sent message is not missed. |
| |
| Also, in case this comes up, there's a slight race on changing the mbox* and the |
| vcore number within the event_q. The message could have gone to the wrong (old) |
| vcore, but not the IPI. Not a big deal - IPIs can be spurious, and the other |
| vcore will eventually get it. The real way around this is create a new ev_q and |
| change the pointer (thus atomically changing the entire ev_q's contents), though |
| this can be a bit tricky if you have multiple places pointing to the same ev_q |
| (can't change them all at once). |
| |
| 3.3.3: Fallback and Preemption |
| --------------- |
| SPAM_INDIR doesn't protect you from preemptions. A vcore can be preempted and |
| have INDIRs in its VCPD. |
| |
| It is tempting to just use sys_change_vcore(), which will change the calling |
| vcore to the new one. This should only be used to "unstick" a vcore. A vcore |
| is stuck when it was preempted while it had notifications disabled. This is |
| usually when it is vcore context, but also in any lock holding code for locks |
| shared with vcore context (the userspace equivalent of irqsave locks). With |
| this syscall, you could change to the offline vcore and process its INDIRs. |
| |
| The problem with that plan is the calling core (that is trying to save the |
| other) may have extra messages, and that sys_change_vcore does not return. We |
| need a way to deal with our other messages. We're back to the same problem we |
| had before, just with different vcores. The only thing we really accomplished |
| is that we unstuck the other vcore. We could tell the restarted vcore (via an |
| event) to switch back to us, but by the time it does that, it may have other |
| events that got lost. So we're back to polling the ev_qs that it might have |
| received INDIRs about. Note that we still want to send an event with |
| sys_change_vcore(). We want the new vcore to know the old vcore was put |
| offline: a preemption (albeit one that it chose to do, and one that isn't stuck |
| in vcore context). |
| |
| One older way to deal with this was to force the 2LS to deal with this. The 2LS |
| would check the ev_mboxes/ev_qs of all ev_qs that could send INDIRS to the |
| offline vcore. There could be INDIRS in the VCPD that are just lying there. |
| The 2LS knows which ev_qs these are (such as for completed syscalls), and for |
| many things, this will be a common ev_q (such as for 'vcore-x-was-preempted'). |
| However, this is a huge pain in the ass, since a preempted vcore could have the |
| spammed INDIR for an ev_q associated with another vcore. To deal with this, |
| the 2LS would need to check *every* ev_q that requests INDIRs. We don't do |
| this. |
| |
| Instead, we simply have the remote core check the VCPD public mbox of the |
| preempted vcore. INDIRs (and other vcore business that other vcores can handle) |
| will get sorted here. |
| |
| 3.3.5: Lists to Find Vcores |
| --------------- |
| A process has three lists: online, bulk_preempt, and inactive. These not only |
| are good for process management, but also for helping alert_vcore() find |
| potentially alertable vcores. alert_vcore() and its associated helpers are |
| failry complicated and heavily commented. I've set things up so both the |
| online_vcs and the bulk_preempted_vcs lists can be handled the same way: post to |
| the first element, then see if it still VC_CAN_RCV_MSG. If not, if it is still |
| the first on the list, then it hasn't proc_yield()ed yet, and it will eventually |
| restart when it tries to yield. And this all works without locking the |
| proc_lock. There are a bunch more details and races avoided. Check the code |
| out. |
| |
| 3.3.6: Vcore Business and the VCPD mboxs |
| --------------- |
| There are two types of VCPD mboxes: public and private. Public ones will get |
| handled during preemption recovery. Messages sent here need to be handle-able |
| by any vcore. Private messages are for that specific vcore. In the common |
| case, the public mbox will usually only get looked at by its vcore. Only during |
| recovery and some corner cases will we deal with it remotely. |
| |
| Here's some guidelines: if you message is spammy and the handler can deal with |
| spurious events and it doesn't need to be on a specific vcore, then go with |
| public. Examples of public mbox events are ones that need to be spammed: |
| preemption recovery, INDIRs, etc. Note that you won't need to worry about |
| these: uthread code and the kernel handle them. But if you have something |
| similar, then that's where it would go. You can also send non-spammy things, |
| but there's no guarantee they'll be looked at. |
| |
| Some messages should only be sent to the private mbox. These include ones that |
| make no sense for other vcores to handle. Examples: 2LS IPIs/preemptions (like |
| "change your scheduling policy vcore 3", preemption-pending notifs from the |
| kernel, timer interrupts, etc. |
| |
| An example of something that shouldn't be sent to either is syscall completions. |
| They can't be spammed, so you can't send them around like INDIRs. And they need |
| to be dealt with. Other than carefully-spammed public messages, there's no |
| guarantee of getting a message for certain scenarios (yields). Instead, use an |
| ev_q with INDIR set. |
| |
| Also note that a 2LS could set up a big ev_q with EVENT_IPI and not EVENT_INDIR, |
| and then poll for that in their vcore_entry(). This is equivalent to setting up |
| a small ev_q with EVENT_IPI and pointing it at the private mbox. |
| |
| 3.4 Application-specific Event Handling |
| --------------------------------------- |
| So what happens when the vcore/2LS isn't handling an event queue, but has been |
| "told" about it? This "telling" is in the form of an IPI. The vcore was |
| prodded, but is not supposed to handle the event. This is actually what |
| happens now in Linux when you send signals for AIO. It's all about who (which |
| thread, in their world) is being interrupted to process the work in an |
| application specific way. The app sets the handler, with the option to have a |
| thread spawned (instead of a sighandler), etc. |
| |
| This is not exactly the same as the case above where the ev_mbox* pointed to |
| the vcore's default mbox. That issue was just about avoiding extra messages |
| (and messages in weird orders). A vcore won't handle an ev_q if the |
| message/contents of the queue aren't meant for the vcore/2LS. For example, a |
| thread can want to run its own handler, perhaps because it performs its own |
| asynchronous I/O (compared to relying on the 2LS to schedule synchronous |
| blocking u_threads). |
| |
| There are a couple ways to handle this. Ultimately, the application is supposed |
| to handle the event. If it asked for an IPI, it is because something ought to |
| be done, which really means running a handler. We used to support the |
| application setting EVENT_THREAD in the ev_q's flags, and the 2LS would spawn a |
| thread to run the ev_q's handler. Now we just have the application block a |
| uthread on the evq. If an ev_handler is set, the vcore will execute the |
| handler itself. Careful with this, since the only memory it touches must be |
| pinned, the function must not block (this is only true for the handlers called |
| directly out of vcore context), and it should return quickly. |
| |
| Note that in either case, vcore-written code (library code) does not look at |
| the contents of the notification event. Also note the handler takes the whole |
| event_queue, and not a specific message. It is more flexible, can handle |
| multiple specific events, and doesn't require the vcore code to dequeue the |
| event and either pass by value or allocate more memory. |
| |
| These ev_q handlers are different than ev_handlers. The former handles an |
| event_queue. The latter is the 2LS's way to handle specific types of messages. |
| If an app wants to process specific messages, have them sent to an ev_q under |
| its control; don't mess with ev_handlers unless you're the 2LS (or example |
| code). |
| |
| Continuing the analogy between vcores getting IPIs and the OS getting HW |
| interrupts, what goes on in vcore context is like what goes on in interrupt |
| context, and the threaded handler is like running a threaded interrupt handler |
| (in Linux). In the ROS world, it is like having the interrupt handler kick |
| off a kernel message to defer the work out of interrupt context. |
| |
| If neither of the application-specific handling flags are set, the vcore will |
| respond to the IPI by attempting to handle the event on its own (lookup table |
| based on the type of event (like "syscall complete")). If you didn't want the |
| vcore to handle it, then you shouldn't have asked for an IPI. Those flags are |
| the means by which the vcore can distinguish between its event_qs and the |
| applications. It does not make sense otherwise to send the vcore an IPI and |
| an event_q, but not tell give the code the info it needs to handle it. |
| |
| In the future, we might have the ability to block a u_thread on an event_q, so |
| we'll have other EV_ flags to express this, and probably a void*. This may |
| end up being redudant, since u_threads will be able to block on syscalls (and |
| not necessarily IPIs sent to vcores). |
| |
| As a side note, a vcore can turn off the IPI wanted flag at any time. For |
| instance, when it spawns a thread to handle an ev_q, the vcore can turn off |
| IPI wanted on that event_q, and the thread handler can turn it back on when it |
| is done processing and wants to be re-IPId. The reason for this is to avoid |
| taking future IPIs (once we leave vcore context, IPIs are enabled) to let us |
| know about an event for which a handler is already running. |
| |
| 3.5 Overflowed/Missed Messages in the VCPD |
| --------------------------------------- |
| This too is no longer necessary. It's useful in that it shows what we don't |
| have to put up with. Missing messages requires potentially painful |
| infrastructure to handle it: |
| |
| ----------------------------- |
| All event_q's requesting IPIs ought to register with the 2LS. This is for |
| recovering in case the vcpd's mbox overflowed, and the vcore knows it missed a |
| NE_EVENT type message. At that point, it would have to check all of its |
| IPI-based queues. To do so, it could check to see if the mbox has any |
| messages, though in all likelihood, we'll just act as if there was a message |
| on each of the queues (all such handlers should be able to handle spurious |
| IPIs anyways). This is analagous to how the OS's block drivers don't solely |
| rely on receiving an interrupt (they deal with it via timeouts). Any user |
| code requiring an IPI must do this. Any code that runs better due to getting |
| the IPI ought to do this. |
| |
| We could imagine having a thread spawned to handle an ev_q, and the vcore |
| never has to touch the ev_q (which might make it easier for memory |
| allocation). This isn't a great idea, but I'll still explain it. In the |
| notif_ev message sent to the vcore, it has the event_q*. We could also send a |
| flag with the same info as in the event_q's flags, and also send the handler. |
| The problem with this is that it isn't resilient to failure. If there was a |
| message overflow, it would have the check the event_q (which was registered |
| before) anyway, and could potentially page fault there. Also the kernel would |
| have faulted on it (and read it in) back when it tried to read those values. |
| It's somewhat moot, since we're going to have an allocator that pins event_qs. |
| ----------------------------- |
| |
| 3.6 Round-Robin or Other IPI-delivery styles |
| --------------------------------------- |
| In the same way that the IOAPIC can deliver interrupts to a group of cores, |
| round-robinning between them, so can we imagine processes wanting to |
| distribute the IPI/active notification of events across its vcores. This is |
| only meaningful is the NOTIF_IPI_WANTED flag is set. |
| |
| Eventually we'll support this, via a flag in the event_q. When |
| NE_ROUND_ROBIN, or whatever, is set a couple things will happen. First, the |
| vcore field will be used in a "delivery-specific" manner. In the case of RR, |
| it will probably be the most recent destination. Perhaps it will be a bitmask |
| of vcores available to receive. More important is the event_mbox*. If it is |
| set, then the event message will be sent there. Whichever vcore gets selected |
| will receive an IPI, and its vcpd mbox will get a NE_EVENT message. If the |
| event_mbox* is 0, then the actual message will get delivered to the vcore's |
| vcpd mbox (the default location). |
| |
| 3.7 Event_q-less Notifications |
| --------------------------------------- |
| Some events needs to be delivered directly to the vcore, regardless of any |
| event_qs. This happens currently when we bypass the notification table (e.g., |
| sys_self_notify(), preemptions, etc). These notifs will just use the vcore's |
| default mbox. In essence, the ev_q is being generated/sent with the call. |
| The implied/fake ev_q points to the vcpd's mbox, with the given vcore set, and |
| with IPI_WANTED set. It is tempting to make those functions take a |
| dynamically generated ev_q, though more likely we'll just use the lower level |
| functions in the kernel, much like the Round Robin set will need to do. No |
| need to force things to fit just for the sake of using a 'solution'. We want |
| tools to make solutions, not packaged solutions. |
| |
| 3.8 UTHREAD_DONT_MIGRATE |
| --------------------------------------- |
| DONT_MIGRATE exists to allow uthreads to disable notifications/IPIs and enter |
| vcore context. It is needed since you need to read vcoreid to disable notifs, |
| but once you read it, you need to not move to another vcore. Here are a few |
| rules/guidelines. |
| |
| We turn off the flag so that we can disable notifs, but turn the flag back on |
| before enabling. The thread won't get migrated in that instant since notifs are |
| off. But if it was the other way, we could miss a message (because we skipped |
| an opportunity to be dropped into vcore context to read a message). |
| |
| Don't check messages/handle events when you have a DONT_MIGRATE uthread. There |
| are issues with preemption recovery if you do. In short, if two uthreads are |
| both DONT_MIGRATE with notifs enabled on two different vcores, and one vcore |
| gets preempted while the other gets an IPI telling it to recover the other one, |
| both could keep bouncing back and forth if they handle their preemption |
| *messages* without dealing with their own DONT_MIGRATEs first. Note that the |
| preemption recovery code can handle having a DONT_MIGRATE thread on the vcore. |
| This is a special case, and it is very careful about how cur_uthread works. |
| |
| All uses of DONT_MIGRATE must reenable notifs (and check messages) at some |
| point. One such case is uthread_yield(). Another is mcs_unlock_notifsafe(). |
| Note that mcs_notif_safe locks have uthreads that can't migrate for a |
| potentially long time. notifs are also disabled, so it's not a big deal. It's |
| basically just the same as if you were in vcore context (though technically you |
| aren't) when it comes to preemption recovery: we'll just need to restart the |
| vcore via a syscall. Also note that it would be a real pain in the ass to |
| migrate a notif_safe locking uthread. The whole point of it is in case it grabs |
| a lock that would be held by vcore context, and there's no way to know it isn't |
| a lock on the restart-path. |
| |
| 3.9 Why Preemption Handling Doesn't Lock Up (probably) |
| --------------------------------------- |
| One of the concerns of preemption handling is that we don't get into some form |
| of livelock, where we ping-pong back and forth between vcores (or a set of |
| vcores), all of which are trying to handle each other's preemptions. Part of |
| the concern is that when a vcore sys_changes to another, it can result in |
| another preemption message being sent. We want to be sure that we're making |
| progress, and not just livelocked doing sys_change_vcore()s. |
| |
| A few notes first: |
| 1) If a vcore is holding locks or otherwise isn't handling events and is |
| preempted, it will let go of its locks before it gets to the point of |
| attempting to handle any other vcore preemption events. Event handling is only |
| done when it is okay to never return (meaning no locks are held). If this is |
| the situation, eventually it'll work itself out or get to a potential ping-pong |
| scenario. |
| |
| 2) When you change_to while handling preemption, once you start back up, you |
| will leave change_to and eventually fetch a new event. This means any |
| potential ping-pong needs to happen on a fresh event. |
| |
| 3) If there are enough pcores for the vcores to all run, we won't issue any |
| change_tos, since the vcores are no longer preempted. This means we only are |
| worried about situations with insufficient vcores. We'll mostly talk about 1 |
| pcore and 2 vcores. |
| |
| 4) Preemption handlers will not call change_to on their target vcore if they |
| are also the one STEALING from that vcore. The handler will stop STEALING |
| first. |
| |
| So the only way to get stuck permanently is if both cores are stuck doing a |
| sys_change_to(FALSE). This means we want to become the other vcore, *and* we |
| need to restart our vcore where it left off. This is due to some invariant |
| that keeps us from abandoning vcore context. If we were to abandon vcore |
| context (with a sys_change_to(TRUE)), we basically don't need to be |
| preempt-recovered. We already packaged up our cur_uthread, and we know we |
| aren't holding any locks or otherwise breaking any invariants. The system will |
| work fine if we never run again. (Someone just needs to check our messages). |
| |
| Now, there are only two cases where we will do a sys_change_to(FALSE) *while* |
| handling preemptions. Again, we aren't concerned about things like MCS-PDR |
| locks; those all work because the change_tos are done where we'd normally just |
| busy loop. We are only concerned about change_tos during handle_vc_preempt. |
| These two cases are when the changing/handling vcore has a DONT_MIGRATE uthread |
| or when someone else is STEALING its uthread. Note that both of these cases |
| are about the calling vcore, not its target. |
| |
| If a vcore (referred to as "us") has a DONT_MIGRATE uthread and it is handling |
| events, it is because someone else is STEALING from our vcore, and we are in |
| the short one-shot event handling loop at the beginning of |
| uthread_vcore_entry(). Whichever vcore is STEALING will quickly realize it |
| can't steal (it sees the DONT_MIGRATE), and bail out. If that vcore isn't |
| running now, we will change_to it (which is the purpose of our handling their |
| preemption). Once that vcore realizes it can't steal, it will stop STEALING |
| and change to us. At this point, no one is STEALING from us, and we move along |
| in the code. Specifically, we do *not* handle events (we now have an event |
| about the other vcore being preempted when it changed_to us), and instead we |
| start up the DONT_MIGRATE uthread and let it run until it is migratable, at |
| which point we handle events and will deal with the other vcore. |
| |
| So DONT_MIGRATE will be sorted out. Likewise, STEALING gets sorted out too, |
| quite easily. If someone is STEALING from us, they will quickly stop STEALING |
| and change to us. There are only two ways this could even happen: they are |
| running concurrently with us, and somehow saw us out of vcore context before |
| deciding to STEAL, or they were in the process of STEALING and got preempted by |
| the kernel. They would not have willingly stopped running while STEALING our |
| cur_uthread. So if we are running and someone is stealing, after a round of |
| change_tos, eventually they run, and stop STEALING. |
| |
| Note that once someone stops STEALING from us, they will not start again, |
| unless we leave vcore context. If that happened, we basically broke out of the |
| ping-pong, and now we're onto another set of preemptions. We wouldn't leave |
| vcore context if we still had preemption events to deal with. |
| |
| Finally, note that we needed to only check for one message at a time at the |
| beginning of uthread_vcore_entry(). If we just handled the entire mbox without |
| checking STEALING, then we might not break out of that loop if there is a |
| constant supply of messages (perhaps from a vcore in a similar loop). |
| |
| Anyway, that's the basic plan behind the preemption handler and how we avoid |
| the ping-ponging. change_to_vcore() is built so that we handle our own |
| preemption before changing (pack up our current uthread), so that we make |
| progress. The two cases where we can't do that get sorted out after everyone |
| gets to run once, and since you can't steal or have other uthread's turn on |
| DONT_MIGRATE while we're in vcore context, eventually we clear everything up. |
| There might be other bugs or weird corner cases, possibly involving multiple |
| vcores, but I think we're okay for now. |
| |
| 3.10: Handling Messages for Other Vcores |
| --------------------------------------- |
| First, remember that when a vcore handles an event, there's no guarantee that |
| the vcore will return from the handler. It may start fresh in vcore_entry(). |
| |
| The issue is that when you handle another vcore's INDIRs, you may handle |
| preemption messages. If you have to do a change_to, the kernel will make sure |
| a message goes out about your demise. Thus someone who recovers that will |
| check your public mbox. However, the recoverer won't know that you were |
| working on another vcore's mbox, so those messages might never be checked. |
| |
| The way around it is to send yourself a "check the other guy's messages" event. |
| When we might change_to and never return, if we were dealing with another |
| vcores mbox, we'll send ourselves a message to finish up that mbox (if there |
| are any messages left). Whoever reads our messages will eventually get that |
| message, and deal with it. |
| |
| One thing that is a little ugly is that the way you deal with messages two |
| layers deep is to send yourself the message. So if VC1 is handling VC2's |
| messages, and then wants to change_to VC3, VC1 sends a message to VC1 to check |
| VC2. Later, when VC3 is checking VC1's messages, it'll handle the "check VC2's messages" |
| message. VC3 can't directly handle VC2's messages, since it could run a |
| handler that doesn't return. Nor can we just forget about VC2. So VC3 sends |
| itself a message to check VC2 later. Alternatively, VC3 could send itself a |
| message to continue checking VC1, and then move on to VC2. Both seem |
| equivalent. In either case, we ought to check to make sure the mbox has |
| something before bothering sending the message. |
| |
| So for either a "change_to that might not return" or for a "check INDIRs on yet |
| another vcore", we send messages to ourself so that we or someone else will |
| deal with it. |
| |
| Note that we use TLS to track whether or not we are handling another vcore's |
| messages, and if we do plan to change_to that might not return, we clear the |
| bool so that when our vcore starts over at vcore_entry(), it starts over and |
| isn't still checking someone elses message. |
| |
| As a reminder of why this is important: these messages we are hunting down |
| include INDIRs, specifically ones to ev_qs such as the "syscall completed |
| ev_q". If we never get that message, a uthread will block forever. If we |
| accidentally yield a vcore instead of checking that message, we would end up |
| yielding the process forever since that uthread will eventually be the last |
| one, but our main thread is probably blocked on a join call. Our process is |
| blocked on a message that already came, but we just missed it. |
| |
| 4. Single-core Process (SCP) Events: |
| ==================== |
| 4.1 Basics: |
| --------------------------------------- |
| Event delivery is important for SCP's blocking syscalls. It can also be used |
| (in the future) to deliver POSIX signals, which would just be another kernel |
| event. |
| |
| SCPs can receive events just like MCPs. For the most part, the code paths are |
| the same on both sides of the K/U interface. The kernel sends events (which |
| can detect an SCP and will send it to vcore0), the kernel will make sure you |
| can't yield/miss an event, etc. Userspace preps vcore context in advance, and |
| can do all the things vcore context does: handle events, select thread to run. |
| For an SCP, there is only one thread to run. |
| |
| 4.2 Degenerate Event Delivery: |
| --------------------------------------- |
| That being said, there are a few tricky things. First, there is a time before |
| the SCP is ready to fully receive events. Specifically, before |
| vcore_event_init(), which is called out of glibc's _start. More importantly, |
| the runtime linker never calls that function, yet it wants to block. |
| |
| The important thing to note is that there are a few parts to event delivery: |
| registration (user), sending the event (kernel), making sure the proc wakes up |
| (kernel), and actually handling the event (user). For syscalls, the only thing |
| the process (even rtld) needs is the first three. Registration is easy - can be |
| done with nothing more than kernel headers (no need for parlib) for NO_MSG ev_qs |
| (no need to init the UCQ). Event handling is trickier, and requires parlib |
| (which rtld can't link against). To support processes that could register for |
| events, but not handle them (or even enter vcore context), the kernel needed a |
| few changes (checking the VC_SCP_NOVCCTX flag) so that it would wake the |
| process, but never put it in vcore context. |
| |
| This degenerate event handling just wakes the process up, at which point it can |
| check on its syscall. Very early in the process's life, it'll init vcore0's |
| UCQ and be able to handle full events, enter vcore context, etc. |
| |
| Once the SCP is up and running, it can receive events like normal. One thing to |
| note is that the SCPs are not using a handle_syscall() event handler, like the |
| MCPs do. They are only using the event to get the process restarted, at which |
| point their vcore 0 restarts thread0. One consequence of this is that if a |
| process receives an unrelated event while blocking on a syscall, it'll handle |
| that event, then restart thread0. Thread0 will see its syscall isn't complete, |
| and then re-block. (It also re-registers its ev_q, which is harmless). When |
| that syscall is finally done, the kernel will send an event and wake it up |
| again. |
| |
| 4.3 Extra Tidbits: |
| --------------------------------------- |
| If we receive an event right as we transition from SCP to MCP, vcore0 could get |
| spammed with a message that is never received. Right now, it's not a problem, |
| since vcore0 is the first vcore that will get woken up as an MCP. This could be |
| an issue if we ever allow transitions from MCP back to SCP. |
| |
| On a related note, it's now wrong for SCPs to sys_yield(FALSE) (not being nice, |
| meaning they are waiting for an event) in a loop that does not check events or |
| otherwise allow them to break out of that loop. This should be fairly obvious. |
| A little more subtle is that these loops also need to sort out notif_pending. |
| If you are trying to yield and still have an old notif_pending set, the kernel |
| won't let you yield (it thinks you are missing the notif). For the degenerate |
| mode, (VC_SCP_NOVCCTX is set on vcore0), the kernel will handle dealing with |
| this flag. |
| |
| Finally, note that while the SCP is in vcore context, it has none of the |
| guarantees of an MCP. It's somewhat meaningless to talk about being gang |
| scheduled or knowing about the state of other vcores. If you're running, you're |
| on a physical core. You may get unexpected interrupts, descheduled, etc. Aside |
| from the guarantees and being the only vcore, the main differences are really up |
| to the kernel scheduler. In that sense, we have somewhat of a new state for |
| processes - SCPs that can enter vcore context. From the user's perspective, |
| they look a lot like an MCP, and the degenerate/early mode SCPs are like the |
| old, dumb SCPs. The big difference for userspace is that there isn't a 2LS yet |
| (will need to reinit things slightly). The kernel treats SCPs and MCPs very |
| differently too, but that may not always be the case. |
| |
| 5. Misc Things That Aren't Sorted Completely: |
| ==================== |
| 5.1 What about short handlers? |
| --------------------------------------- |
| Once we sort the other issues, we can ask for them via a flag in the event_q, |
| and run the handler in the event_q struct. |
| |
| 5.2 What about blocking on a syscall? |
| --------------------------------------- |
| The current plan is to set a flag, and let the kernel go from there. The |
| kernel knows which process it is, since that info is saved in the kthread that |
| blocked. One issue is that the process could muck with that flag and then go |
| to sleep forever. To deal with that, maybe we'd have a long running timer to |
| reap those. Arguably, it's like having a process while(1). You can screw |
| yourself, etc. Killing the process would still work. |