| process-internals.txt | 
 | Barret Rhoden | 
 |  | 
 | This discusses core issues with process design and implementation.  Most of this | 
 | info is available in the source in the comments (but may not be in the future). | 
 | For now, it's a dumping ground for topics that people ought to understand before | 
 | they muck with how processes work. | 
 |  | 
 | Contents: | 
 | 1. Reference Counting | 
 | 2. When Do We Really Leave "Process Context"? | 
 | 3. Leaving the Kernel Stack | 
 | 4. Preemption and Notification Issues | 
 | 5. current_ctx and owning_proc | 
 | 6. Locking! | 
 | 7. TLB Coherency | 
 | 8. Process Management | 
 | 9. On the Ordering of Messages | 
 | 10. TBD | 
 |  | 
 | 1. Reference Counting | 
 | =========================== | 
 | 1.1 Basics: | 
 | --------------------------- | 
 | Reference counts exist to keep a process alive.  We use krefs for this, similar | 
 | to Linux's kref: | 
 | - Can only incref if the current value is greater than 0, meaning there is | 
 |   already a reference to it.  It is a bug to try to incref on something that has | 
 |   no references, so always make sure you incref something that you know has a | 
 |   reference.  If you don't know, you need to get it from pid2proc (which is a | 
 |   careful way of doing this - pid2proc kref_get_not_zero()s on the reference that is | 
 |   stored inside it).  If you incref and there are 0 references, the kernel will | 
 |   panic.  Fix your bug / don't incref random pointers. | 
 | - Can always decref. | 
 | - When the decref returns 0, perform some operation.  This does some final | 
 |   cleanup on the object. | 
 | - Process code is trickier since we frequently make references from 'current' | 
 |   (which isn't too bad), but also because we often do not return and need to be | 
 |   careful about the references we passed in to a no-return function. | 
 |  | 
 | 1.2 Brief History of the Refcnt: | 
 | --------------------------- | 
 | Originally, the refcnt was created to keep page tables from being destroyed (in | 
 | proc_free()) while cores were still using them, which is what was happens during | 
 | an ARSC (async remote syscall).  It was then defined to be a count of places in | 
 | the kernel that had an interest in the process staying alive, practically just | 
 | to protect current/cr3.  This 'interest' actually extends to any code holding a | 
 | pointer to the proc, such as one acquired via pid2proc(), which is its current | 
 | meaning. | 
 |  | 
 | 1.3 Quick Aside: The current Macro: | 
 | --------------------------- | 
 | current is a pointer to the proc that is currently loaded/running on any given | 
 | core.  It is stored in the per_cpu_info struct, and set/managed by low-level | 
 | process code.  It is necessary for the kernel to quickly figure out who is | 
 | running on its core, especially when servicing interrupts and traps.  current is | 
 | protected by a refcnt. | 
 |  | 
 | current does not say which process owns / will-run on a core.  The per-cpu | 
 | variable 'owning_proc' covers that.  'owning_proc' should be treated like | 
 | 'current' (aka, 'cur_proc') when it comes to reference counting.  Like all | 
 | refcnts, you can use it, but you can't consume it without atomically either | 
 | upping the refcnt or passing the reference (clearing the variable storing the | 
 | reference).  Don't pass it to a function that will consume it and not return | 
 | without upping it. | 
 |  | 
 | 1.4 Reference Counting Rules: | 
 | --------------------------- | 
 | +1 for existing. | 
 | - The fact that the process is supposed to exist is worth +1.  When it is time | 
 |   to die, we decref, and it will eventually be cleaned up.  This existence is | 
 |   explicitly kref_put()d in proc_destroy(). | 
 | - The hash table is a bit tricky.  We need to kref_get_not_zero() when it is | 
 |   locked, so we know we aren't racing with proc_free freeing the proc and | 
 |   removing it from the list.  After removing it from the hash, we don't need to | 
 |   kref_put it, since it was an internal ref.  The kref (i.e. external) isn't for | 
 |   being on the hash list, it's for existing.  This separation allows us to | 
 |   remove the proc from the hash list in the "release" function.  See kref.txt | 
 |   for more details. | 
 |  | 
 | +1 for someone using it or planning to use it. | 
 | - This includes simply having a pointer to the proc, since presumably you will | 
 |   use it.  pid2proc() will incref for you.  When you are done, decref. | 
 | - Functions that create a process and return a pointer (like proc_create() or | 
 |   kfs_proc_create()) will also up the refcnt for you.  Decref when you're done. | 
 | - If the *proc is stored somewhere where it will be used again, such as in an IO | 
 |   continuation, it needs to be refcnt'd.  Note that if you already had a | 
 |   reference from pid2proc(), simply don't decref after you store the pointer. | 
 |  | 
 | +1 for current. | 
 | - current counts as someone using it (expressing interest in the core), but is | 
 |   also a source of the pointer, so its a bit different.  Note that all kref's | 
 |   are sources of a pointer.  When we are running on a core that has current | 
 |   loaded, the ref is both for its usage as well as for being the current | 
 |   process. | 
 | - You have a reference from current and can use it without refcnting, but | 
 |   anything that needs to eat a reference or store/use it needs an incref first. | 
 |   To be clear, your reference is *NOT* edible.  It protects the cr3, guarantees | 
 |   the process won't die, and serves as a bootstrappable reference. | 
 | - Specifically, if you get a ref from current, but then save it somewhere (like | 
 |   an IO continuation request), then clearly you must incref, since it's both | 
 |   current and stored/used. | 
 | - If you don't know what might be downstream from your function, then incref | 
 |   before passing the reference, and decref when it returns.  We used to do this | 
 |   for all syscalls, but now only do it for calls that might not return and | 
 |   expect to receive reference (like proc_yield). | 
 |  | 
 | All functions that take a *proc have a refcnt'd reference, though it may not be | 
 | edible (it could be current).  It is the callers responsibility to make sure | 
 | it'd edible if it necessary.  It is the callees responsibility to incref if it | 
 | stores or makes a copy of the reference. | 
 |  | 
 | 1.5 Functions That Don't or Might Not Return: | 
 | --------------------------- | 
 | Refcnting and especially decreffing gets tricky when there are functions that | 
 | MAY not return.  proc_restartcore() does not return (it pops into userspace). | 
 | proc_run() used to not return, if the core it was called on would pop into | 
 | userspace (if it was a _S, or if the core is part of the vcoremap for a _M). | 
 | This doesn't happen anymore, since we have cur_ctx in the per-cpu info. | 
 |  | 
 | Functions that MAY not return will "eat" your reference *IF* they do not return. | 
 | This means that you must have a reference when you call them (like always), and | 
 | that reference will be consumed / decref'd for you if the function doesn't | 
 | return.  Or something similarly appropriate. | 
 |  | 
 | Arguably, for functions that MAY not return, but will always be called with | 
 | current's reference (proc_yield()), we could get away without giving it an | 
 | edible reference, and then never eating the ref.  Yield needs to be reworked | 
 | anyway, so it's not a bit deal yet. | 
 |  | 
 | We do this because when the function does not return, you will not have the | 
 | chance to decref (your decref code will never run).  We need the reference when | 
 | going in to keep the object alive (like with any other refcnt).  We can't have | 
 | the function always eat the reference, since you cannot simply re-incref the | 
 | pointer (not allowed to incref unless you know you had a good reference).  You'd | 
 | have to do something like p = pid2proc(p_pid);  It's clunky to do that, easy to | 
 | screw up, and semantically, if the function returns, then we may still have an | 
 | interest in p and should decref later. | 
 |  | 
 | The downside is that functions need to determine if they will return or not, | 
 | which can be a pain (for an out-of-date example: a linear time search when | 
 | running an _M, for instance, which can suck if we are trying to use a | 
 | broadcast/logical IPI). | 
 |  | 
 | As the caller, you usually won't know if the function will return or not, so you | 
 | need to provide a consumable reference.  Current doesn't count.  For example, | 
 | proc_run() requires a reference.  You can proc_run(p), and use p afterwards, and | 
 | later decref.  You need to make sure you have a reference, so things like | 
 | proc_run(pid2proc(55)) works, since pid2proc() increfs for you.  But you cannot | 
 | proc_run(current), unless you incref current in advance.  Incidentally, | 
 | proc_running current doesn't make a lot of sense. | 
 |  | 
 | 1.6 Runnable List: | 
 | --------------------------- | 
 | Procs on the runnable list need to have a refcnt (other than the +1 for | 
 | existing).  It's something that cares that the process exists.  We could have | 
 | had it implicitly be refcnt'd (the fact that it's on the list is enough, sort of | 
 | as if it was part of the +1 for existing), but that complicates things.  For | 
 | instance, it is a source of a reference (for the scheduler) and you could not | 
 | proc_run() a process from the runnable list without worrying about increfing it | 
 | before hand.  This isn't true anymore, but the runnable lists are getting | 
 | overhauled anyway.  We'll see what works nicely. | 
 |  | 
 | 1.7 Internal Details for Specific Functions: | 
 | --------------------------- | 
 | proc_run()/__proc_give_cores(): makes sure enough refcnts are in place for all | 
 | places that will install owning_proc.  This also makes it easier on the system | 
 | (one big incref(n), instead of n increfs of (1) from multiple cores).  | 
 |  | 
 | __set_proc_current() is a helper that makes sure p is the cur_proc.  It will | 
 | incref if installing a new reference to p.  If it removed an old proc, it will | 
 | decref. | 
 |  | 
 | __proc_startcore(): assumes all references to p are sorted.  It will not | 
 | return, and you should not pass it a reference you need to decref().  Passing | 
 | it 'owning_proc' works, since you don't want to decref owning_proc. | 
 |  | 
 | proc_destroy(): it used to not return, and back then if your reference was | 
 | from 'current', you needed to incref.  Now that proc_destroy() returns, it | 
 | isn't a big deal.  Just keep in mind that if you have a function that doesn't | 
 | return, there's no way for the function to know if it's passed reference is | 
 | edible.  Even if p == current, proc_destroy() can't tell if you sent it p (and | 
 | had a reference) or current and didn't. | 
 |  | 
 | proc_yield(): when this doesn't return, it eats your reference.  It will also | 
 | decref twice.  Once when it clears_owning_proc, and again when it calls | 
 | abandon_core() (which clears cur_proc). | 
 |  | 
 | abandon_core(): it was not given a reference, so it doesn't eat one.  It will | 
 | decref when it unloads the cr3.  Note that this is a potential performance | 
 | issue.  When preempting or killing, there are n cores that are fighting for the | 
 | cacheline to decref.  An alternative would be to have one core decref for all n | 
 | cores, after it knows all cores unloaded the cr3.  This would be a good use of | 
 | the checklist (possibly with one cacheline per core).  It would take a large | 
 | amount of memory for better scalability. | 
 |  | 
 | 1.8 Things I Could Have Done But Didn't And Why: | 
 | --------------------------- | 
 | Q: Could we have the first reference (existence) mean it could be on the runnable | 
 | list or otherwise in the proc system (but not other subsystems)?  In this case, | 
 | proc_run() won't need to eat a reference at all - it will just incref for every | 
 | current it will set up. | 
 |  | 
 | New A: Maybe, now that proc_run() returns. | 
 |  | 
 | Old A: No: if you pid2proc(), then proc_run() but never return, you have (and | 
 | lose) an extra reference.  We need proc_run() to eat the reference when it | 
 | does not return.  If you decref between pid2proc() and proc_run(), there's a | 
 | (rare) race where the refcnt hits 0 by someone else trying to kill it.  While | 
 | proc_run() will check to see if someone else is trying to kill it, there's a | 
 | slight chance that the struct will be reused and recreated.  It'll probably | 
 | never happen, but it could, and out of principle we shouldn't be referencing | 
 | memory after it's been deallocated.  Avoiding races like this is one of the | 
 | reasons for our refcnt discipline. | 
 |  | 
 | Q: (Moot) Could proc_run() always eat your reference, which would make it | 
 | easier for its implementation? | 
 |  | 
 | A: Yeah, technically, but it'd be a pain, as mentioned above.  You'd need to | 
 | reaquire a reference via pid2proc, and is rather easy to mess up. | 
 |  | 
 | Q: (Moot) Could we have made proc_destroy() take a flag, saying whether or not | 
 | it was called on current and needed a decref instead of wasting an incref? | 
 |  | 
 | A: We could, but won't.  This is one case where the external caller is the one | 
 | that knows the function needs to decref or not.  But it breaks the convention a | 
 | bit, doesn't mirror proc_create() as well, and we need to pull in the cacheline | 
 | with the refcnt anyways.  So for now, no. | 
 |  | 
 | Q: (Moot) Could we make __proc_give_cores() simply not return if an IPI is | 
 | coming? | 
 |  | 
 | A: I did this originally, and manually unlocked and __wait_for_ipi()d.  Though | 
 | we'd then need to deal with it like that for all of the related functions, which | 
 | doesn't work if you wanted to do something afterwards (like schedule(p)).  Also | 
 | these functions are meant to be internal helpers, so returning the bool makes | 
 | more sense.  It eventually led to having __proc_unlock_ipi_pending(), which made | 
 | proc_destroy() much cleaner and helped with a general model of dealing with | 
 | these issues.  Win-win. | 
 |  | 
 | 2. When Do We Really Leave "Process Context"? | 
 | =========================== | 
 | 2.1 Overview | 
 | --------------------------- | 
 | First off, it's not really "process context" in the way Linux deals with it.  We | 
 | aren't operating in kernel mode on behalf of the process (always).  We are | 
 | specifically talking about when a process's cr3 is loaded on a core.  Usually, | 
 | current is also set (the exception for now is when processing ARSCs). | 
 |  | 
 | There are a couple different ways to do this.  One is to never unload a context | 
 | until something new is being run there (handled solely in __proc_startcore()). | 
 | Another way is to always explicitly leave the core, like by abandon_core()ing. | 
 |  | 
 | The issue with the former is that you could have contexts sitting around for a | 
 | while, and also would have a bit of extra latency when __proc_free()ing during | 
 | someone *else's* __proc_startcore() (though that could be avoided if it becomes | 
 | a real issue, via some form of reaping).  You'll also probably have excessive | 
 | decrefs (based on the interactions between proc_run() and __startcore()). | 
 |  | 
 | The issue with the latter is excessive TLB shootdowns and corner cases.  There | 
 | could be some weird cases (in core_request() for example) where the core you are | 
 | running on has the context loaded for proc A on a mgmt core, but decides to give | 
 | it to proc B. | 
 |  | 
 | If no process is running there, current == 0 and boot_cr3 is loaded, meaning no | 
 | process's context is loaded. | 
 |  | 
 | All changes to cur_proc, owning_proc, and cur_ctx need to be done with | 
 | interrupts disabled, since they change in interrupt handlers. | 
 |  | 
 | 2.2 Here's how it is done now: | 
 | --------------------------- | 
 | All code is capable of 'spamming' cur_proc (with interrupts disabled!).  If it | 
 | is 0, feel free to set it to whatever process you want.  All code that | 
 | requires current to be set will do so (like __proc_startcore()).  The | 
 | smp_idle() path will make sure current is clear when it halts.  So long as you | 
 | don't change other concurrent code's expectations, you're okay.  What I mean | 
 | by that is you don't clear cur_proc while in an interrupt handler.  But if it | 
 | is already 0, __startcore is allowed to set it to it's future proc (which is | 
 | an optimization).  Other code didn't have any expectations of it (it was 0). | 
 | Likewise, kthread code when we sleep_on() doesn't have to keep cur_proc set. | 
 | A kthread is somewhat an isolated block (codewise), and leaving current set | 
 | when it is done is solely to avoid a TLB flush (at the cost of an incref). | 
 |  | 
 | In general, we try to proactively leave process context, but have the ability | 
 | to stay in context til __proc_startcore() to handle the corner cases (and to | 
 | maybe cut down the TLB flushes later).  To stop proactively leaving, just | 
 | change abandon_core() to not do anything with current/cr3.  You'll see weird | 
 | things like processes that won't die until their old cores are reused.  The | 
 | reason we proactively leave context is to help with sanity for these issues, | 
 | and also to avoid decref's in __startcore(). | 
 |  | 
 | A couple other details: __startcore() sorts the extra increfs, and | 
 | __proc_startcore() sorts leaving the old context.  Anytime a __startcore kernel | 
 | message is sent, the sender increfs in advance for the owning_proc refcnt.  As | 
 | an optimization, we can also incref to *attempt* to set current.  If current | 
 | was 0, we set it.  If it was already something else, we failed and need to | 
 | decref.  __proc_startcore(), which the last moment before we *must* have the | 
 | cr3/current issues sorted, does the actual check if there was an old process | 
 | there or not, while it handles the lcr3 (if necessary).  In general, lcr3's | 
 | ought to have refcnts near them, or else comments explaining why not. | 
 |  | 
 | So we leave process context when told to do so (__death/abandon_core()) or if | 
 | another process is run there.  The _M code is such that a proc will stay on its | 
 | core until it receives a message, and that message would cleanup/restore a | 
 | generic context (boot_cr3).  A _S could stay on its core until another _S came | 
 | in.  This is much simpler for cases when a timer interrupt goes off to force a | 
 | schedule() decision.  It also avoids a TLB flush in case the scheduler picked | 
 | that same proc to run again.  This could also happen to an _M, if for some | 
 | reason it was given a management core (!!!) or some other event happened that | 
 | caused some management/scheduling function to run on one of it's cores (perhaps | 
 | it asked). | 
 |  | 
 | proc_yield() abandons the core / leaves context. | 
 |  | 
 | 2.3 Other issues: | 
 | --------------------------- | 
 | Note that dealing with interrupting processes that are in the kernel is tricky. | 
 | There is no true process context, so we can't leave a core until the kernel is | 
 | in a "safe place", i.e. it's state is bundled enough that it can be recontinued | 
 | later.  Calls of this type are routine kernel messages, executed at a convenient | 
 | time (specifically, before we return to userspace in proc_restartcore(). | 
 |  | 
 | This same thing applies to __death messages.  Even though a process is dying, it | 
 | doesn't mean we can just drop whatever the kernel was doing on its behalf.  For | 
 | instance, it might be holding a reference that will never get decreffed if its | 
 | stack gets dropped. | 
 |  | 
 | 3. Leaving the Kernel Stack: | 
 | =========================== | 
 | Just because a message comes in saying to kill a process, it does not mean we | 
 | should immediately abandon_core().  The problem is more obvious when there is | 
 | a preempt message, instead of a death message, but either way there is state | 
 | that needs cleaned up (refcnts that need downed, etc). | 
 |  | 
 | The solution to this is rather simple: don't abandon right away.  That was | 
 | always somewhat the plan for preemption, but was never done for death.  And | 
 | there are several other cases to worry about too.  To enforce this, we expand | 
 | the old "active messages" into a generic work execution message (a kernel | 
 | message) that can be delayed or shipped to another core.  These types of | 
 | messages will not be executed immediately on the receiving pcore - instead they | 
 | are on the queue for "when there's nothing else to do in the kernel", which is | 
 | checked in smp_idle() and before returning to userspace in proc_restartcore(). | 
 | Additionally, these kernel messages can also be queued on an alarm queue, | 
 | delaying their activation as part of a generic kernel alarm facility. | 
 |  | 
 | One subtlety is that __proc_startcore() shouldn't check for messages, since it | 
 | is called by __startcore (a message).  Checking there would run the messages out | 
 | of order, which is exactly what we are trying to avoid (total chaos).  No one | 
 | should call __proc_startcore, other than proc_restartcore() or __startcore(). | 
 | If we ever have functions that do so, if they are not called from a message, | 
 | they must check for outstanding messages. | 
 |  | 
 | This last subtlety is why we needed to change proc_run()'s _S case to use a | 
 | local message instead of calling proc_starcore (and why no one should ever call | 
 | proc_startcore()).  We could unlock, thereby freeing another core to change the | 
 | proc state and send a message to us, then try to proc_startcore, and then | 
 | reading the message before we had installed current or had a userspace TF to | 
 | preempt, and probably a few other things.  Treating _S as a local message is | 
 | cleaner, begs to be merged in the code with _M's code, and uses the messaging | 
 | infrastructure to avoid all the races that it was created to handle. | 
 |  | 
 | Incidentally, we don't need to worry about missing messages while trying to pop | 
 | back to userspace from __proc_startcore, since an IPI will be on the way | 
 | (possibly a self-ipi caused by the __kernel_message() handler).  This is also | 
 | why we needed to make process_routine_kmsg() keep interrupts disabled when it | 
 | stops (there's a race between checking the queue and disabling ints). | 
 |  | 
 | 4. Preemption and Notification Issues: | 
 | =========================== | 
 | 4.1: Message Ordering and Local Calls: | 
 | --------------------------- | 
 | Since we go with the model of cores being told what to do, there are issues | 
 | with messages being received in the wrong order.  That is why we have the | 
 | kernel messages (guaranteed, in-order delivery), with the proc-lock protecting | 
 | the send order.  However, this is not enough for some rare races. | 
 |  | 
 | Local calls can also perform the same tasks as messages (calling | 
 | proc_destroy() while a death IPI is on its way). We refer to these calls as | 
 | messing with "local fate" (compared to global state (we're clever). | 
 | Preempting a single vcore doesn't change the process's state).  These calls | 
 | are a little different, because they also involve a check to see if it should | 
 | perform the function or other action (e.g., death just idling and waiting for | 
 | an IPI instead of trying to kill itself), instead of just blindly doing | 
 | something. | 
 |  | 
 | 4.1.1: Possible Solutions | 
 | ---------------- | 
 | There are two ways to deal with this.  One (and the better one, I think) is to | 
 | check state, and determine if it should proceed or abort.  This requires that | 
 | all local-fate dependent calls always have enough state to do its job.  In the | 
 | past, this meant that any function that results in sending a directive to a | 
 | vcore store enough info in the proc struct that a local call can determine if | 
 | it should take action or abort.  In the past, we used the vcore/pcoremap as a | 
 | way to send info to the receiver about what vcore they are (or should be). | 
 | Now, we store that info in pcpui (for '__startcore', we send it as a | 
 | parameter.  Either way, the general idea is still true: local calls can | 
 | proceed when they are called, and not self-ipi'd to a nebulous later time. | 
 |  | 
 | The other way is to send the work (including the checks) in a self-ipi kernel | 
 | message.  This will guarantee that the message is executed after any existing | 
 | messages (making the k_msg queue the authority for what should happen to a | 
 | core).  The check is also performed later (when the k_msg executes).  There | 
 | are a couple issues with this: if we allow the local core to send itself an | 
 | k_msg that could be out of order (meaning it should not be sent, and is only | 
 | sent due to ignorance of its sealed fate), AND if we return the core to the | 
 | idle-core-list once its fate is sealed, we need to detect that the message is | 
 | for the wrong process and that the process is in the wrong state.  To do this, | 
 | we probably need local versioning on the pcore so it can detect that the | 
 | message is late/wrong.  We might get by with just the proc* (though that is | 
 | tricky with death and proc reuse), so long as we don't allow new startcores | 
 | for a proc until AFTER the preemption is completed. | 
 |  | 
 | 4.2: Preempt-Served Flag | 
 | ---------------- | 
 | We want to be able to consider a pcore free once its owning proc has dealt | 
 | with removing it.  This allows a scheduler-like function to easily take a core | 
 | and then give it to someone else, without waiting for each vcore to respond, | 
 | saying that the pcore is free/idle. | 
 |  | 
 | We used to not unmap until we were in '__preempt' or '__death', and we needed | 
 | a flag to tell yield-like calls that a message was already on the way and to | 
 | not rely on the vcoremap.  This is pretty fucked up for a number of reasons, | 
 | so we changed that.  But we still wanted to know when a preempt was in | 
 | progress so that the kernel could avoid giving out the vcore until the preempt | 
 | was complete. | 
 |  | 
 | Here's the scenario: we send a '__startcore' to core 3 for VC5->PC3.  Then we | 
 | quickly send a '__preempt' to 3, and then a '__startcore' to core 4 (a | 
 | different pcore) for VC5->PC4.  Imagine all of this happens before the first | 
 | '__startcore' gets processed (IRQ delay, fast ksched, whatever).  We need to | 
 | not run the second '__startcore' on pcore 4 before the preemption has saved | 
 | all of the state of the VC5.  So we spin on preempt_served (which may get | 
 | renamed to preempt_in_progress).  We need to do this in the sender, and not | 
 | the receiver (not in the kmsg), because the kmsgs can't tell which one they | 
 | are.  Specifically, the first '__startcore' on core 3 runs the same code as | 
 | the '__startcore' on core 4, working on the same vcore.  Anything we tell VC5 | 
 | will be seen by both PC3 and PC4.  We'd end up deadlocking on PC3 while it | 
 | spins waiting for the preempt message that also needs to run on PC3. | 
 |  | 
 | The preempt_pending flag is actual a timestamp, with the expiration time of | 
 | the core at which the message will be sent.  We could try to use that, but | 
 | since alarms aren't fired at exactly the time they are scheduled, the message | 
 | might not actually be sent yet (though it will, really soon).  Still, we'll | 
 | just go with the preempt-served flag for now. | 
 |  | 
 | 4.3: Impending Notifications | 
 | ---------------- | 
 | It's also possible that there is an impending notification.  There's no change | 
 | in fate (though there could be a fate-changing preempt on its way), just the | 
 | user wants a notification handler to run.  We need a flag anyways for this | 
 | (discussed below), so proc_yield() or whatever other local call we have can | 
 | check this flag as well.   | 
 |  | 
 | Though for proc_yield(), it doesn't care if a notification is on its way (can | 
 | be dependent on a flag to yield from userspace, based on the nature of the | 
 | yield (which still needs to be sorted)).  If the yield is in response to a | 
 | preempt_pending, it actually should yield and not receive the notification. | 
 | So it should destroy its vcoreid->pcoreid mapping and abandon_core().  When | 
 | that notification hits, it will be for a proc that isn't current, and will be | 
 | ignored (it will get run the next time that vcore fires up, handled below). | 
 |  | 
 | There is a slight chance that the same proc will run on that pcore, but with a | 
 | different vcoreid.  In the off chance this happens, the new vcore will get a | 
 | spurious notification.  Userspace needs to be able to handle spurious | 
 | notifications anyways, (there are a couple other cases, and in general it's | 
 | not hard to do), so this is not a problem.  Instead of trying to have the | 
 | kernel ignore the notification, we just send a spurious one.  A crappy | 
 | alternative would be to send the vcoreid with the notification, but that would | 
 | mean we can't send a generic message (broadcast) to a bunch of cores, which | 
 | will probably be a problem later. | 
 |  | 
 | Note that this specific case is because the "local work message" gets | 
 | processed out of order with respect to the notification.  And we want this in | 
 | that case, since that proc_yield() is more important than the notification. | 
 |  | 
 | 4.4: Preemption / Allocation Phases and Alarm Delays | 
 | --------------------------- | 
 | A per-vcore preemption phase starts when the kernel marks the core's | 
 | preempt_pending flag/counter and can includes the time when an alarm is | 
 | waiting to go off to reclaim the core.  The phase ends when the vcore's pcore | 
 | is reclaimed, either as a result of the kernel taking control, or because a | 
 | process voluntarily yielded. | 
 |  | 
 | Specifically, the preempt_pending variable is actually a timestamp for when | 
 | the core will be revoked (this assumes some form of global time, which we need | 
 | anyways).  If its value is 0, then there is no preempt-pending, it is not in a | 
 | phase, and the vcore can be given out again.  | 
 |  | 
 | When a preempt alarm goes off, the alarm only means to check a process for | 
 | expired vcores.  If the vcore has been yielded while the alarm was pending, | 
 | the preempt_pending flag will be reset to 0.  To speed up the search for | 
 | vcores to preempt, there's a circular buffer corelist in the proc struct, with | 
 | vcoreids of potential suspects.  Or at least this will exist at some point. | 
 | Also note that the preemption list isn't bound to a specific alarm: you can | 
 | check the list at any time (not necessarily on a specific alarm), and you can | 
 | have spurious alarms (the list is empty, so it'll be a noop). | 
 |  | 
 | Likewise, a global preemption phase is when an entire MCP is getting | 
 | gang_prempted, and the global deadline is set.  A function can quickly check | 
 | to see if the process responded, since the list of vcores with preemptions | 
 | pending will be empty. | 
 |  | 
 | It seems obvious, but we do not allow allocation of a vcore during its | 
 | preemption phase.  The main reason is that it can potentially break | 
 | assumptions about the vcore->pcore mapping and can result in multiple | 
 | instances of the same vcore on different pcores.  Imagine a preempt message | 
 | sent to a pcore (after the alarm goes off), meanwhile that vcore/pcore yields | 
 | and the vcore reactivates somewhere else.  There is a potential race on the | 
 | vcore_ctx state: the new vcore is reading while the old is writing.  This | 
 | issue is sorted naturally: the vcore entry in the vcoremap isn't cleared until | 
 | the vcore/pcore is actually yielded/taken away, so the code looking for a free | 
 | vcoreid slot will not try to use it. | 
 |  | 
 | Note that if we didn't design the alarm system to simply check for | 
 | preemptions (perhaps it has a stored list of vcores to preempt), then we | 
 | couldn't end the preempt-phase until the alarm was sorted.  If that is the | 
 | case, we could easily give out a vcore that had been yielded but was still in | 
 | a preempt-phase.  Stopping an alarm would be tricky too, since there could be | 
 | lots of vcores in different states that need to be sorted by the alarm (so | 
 | ripping it out isn't enough).  Setting a flag might not be enough either. | 
 | Vcore version numbers/history (as well as global proc histories) is a pain I'd | 
 | like to avoid too.  So don't change the alarm / delayed preemption system | 
 | without thinking about this. | 
 |  | 
 | Also, allowing a vcore to restart while preemptions are pending also mucks | 
 | with keeping the vcore mapping "old" (while the message is in flight).  A | 
 | pcore will want to use that to determine which vcore is running on it.  It | 
 | would be possible to keep a pcoremap for the reverse mapping out of sync, but | 
 | that seems like a bad idea.  In general, having the pcoremap is a good idea | 
 | (whenever we talk about a vcoremap, we're usually talking about both | 
 | directions: "the vcore->pcore mapping"). | 
 |  | 
 | 4.5: Global Preemption Flags | 
 | --------------------------- | 
 | If we are trying to preempt an entire process at the same time, instead of | 
 | playing with the circular buffer of vcores pending preemption, we could have a | 
 | global timer as well.  This avoids some O(n) operations, though it means that | 
 | userspace needs to check two "flags" (expiration dates) when grabbing its | 
 | preempt-critical locks. | 
 |  | 
 | 4.6: Notifications Mixed with Preemption and Sleeping | 
 | --------------------------- | 
 | It is possible that notifications will mix with preemptions or come while a | 
 | process is not running.  Ultimately, the process wants to be notified on a | 
 | given vcore.  Whenever we send an active notification, we set a flag in procdata | 
 | (notif_pending).  If the vcore is offline, we don't bother sending the IPI/notif | 
 | message.  The kernel will make sure it runs the notification handler (as well as | 
 | restoring the vcore_ctx) the next time that vcore is restarted.  Note that | 
 | userspace can toggle this, so they can handle the notifications from a different | 
 | core if it likes, or they can independently send a notification. | 
 |  | 
 | Note we use notif_pending to detect if an IPI was missed while notifs were | 
 | disabled (this is done in pop_user_ctx() by userspace).  The overall meaning | 
 | of notif_pending is that a vcore wants to be IPI'd.  The IPI could be | 
 | in-flight, or it could be missed.  Since notification IPIs can be spurious, | 
 | when we have potential races, we err on the side of sending.  This happens | 
 | when pop_user_ctx() notifies itself, and when the kernel makes sure to start a | 
 | vcore in vcore context if a notif was pending.  This was simplified a bit over | 
 | the years by having uthreads always get saved into the uthread_ctx (formerly | 
 | the notif_tf), instead of in the old preempt_tf (which is now the vcore_ctx). | 
 |  | 
 | If a vcore has a preempt_pending, we will still send the active notification | 
 | (IPI).  The core ought to get a notification for the preemption anyway, so we | 
 | need to be able to send one.  Additionally, once the vcore is handling that | 
 | preemption notification, it will have notifs disabled, which will prevent us | 
 | from sending any extra notifications anyways. | 
 |   | 
 | 4.7: Notifs While a Preempt Message is Served | 
 | --------------------------- | 
 | It is possible to have the kernel handling a notification k_msg and to have a | 
 | preempt k_msg in the queue (preempt-served flag is set).  Ultimately, what we | 
 | want is for the core to be preempted and the notification handler to run on | 
 | the next execution.  Both messages are in the k_msg queue for "a convenient | 
 | time to leave the kernel" (I'll have a better name for that later).  What we | 
 | do is execute the notification handler and jump to userspace.  Since there is | 
 | still an k_msg in the queue (and we self_ipi'd ourselves, it's part of how | 
 | k_msgs work), the IPI will fire and push us right back into the kernel to | 
 | execute the preemption, and the notif handler's context will be saved in the | 
 | vcore_ctx (ready to go when the vcore gets started again). | 
 |  | 
 | We could try to just leave the notif_pending flag set and ignore the message, | 
 | but that would involve inspecting the queue for the preempt k_msg. | 
 | Additionally, a preempt k_msg can arrive anyway.  Finally, it's possible to have | 
 | another message in the queue between the notif and the preempt, and it gets ugly | 
 | quickly trying to determine what to do. | 
 |  | 
 | 4.8: When a Pcore is "Free" | 
 | --------------------------- | 
 | There are a couple ways to handle pcores.  One approach would be to not | 
 | consider them free and able to be given to another process until the old | 
 | process is completely removed (abandon_core()).  Another approach is to free | 
 | the core once its fate is sealed (which we do).  This probably gives more | 
 | flexibility in schedule()-like functions (no need to wait to give the core | 
 | out), quicker dispatch latencies, less contention on shared structs (like the | 
 | idle-core-map), etc. | 
 |  | 
 | This 'freeing' of the pcore is from the perspective of the kernel scheduler | 
 | and the proc struct.  Contrary to all previous announcements, vcores are | 
 | unmapped from pcores when sending k_msgs (technically right after), while | 
 | holding the lock.  The pcore isn't actually not-running-the-proc until the | 
 | kmsg completes and we abandon_core().  Previously, we used the vcoremap to | 
 | communicate to other cores in a lock-free manner, but that was pretty shitty | 
 | and now we just store the vcoreid in pcpu info. | 
 |  | 
 | Another tricky part is the seq_ctr used to signal userspace of changes to the | 
 | coremap or num_vcores (coremap_seqctr).  While we may not even need this in the | 
 | long run, it still seems like it could be useful.  The trickiness comes from | 
 | when we update the seq_ctr when we are unmapping vcores on the receive side of a | 
 | message (like __death or __preempt).  We'd rather not have each pcore contend on | 
 | the seq_ctr cache line (let alone any locking) while they perform a somewhat | 
 | data-parallel task.  So we continue to have the sending core handle the seq_ctr | 
 | upping and downing.  This works, since the "unlocking" happens after messages | 
 | are sent, which means the receiving core is no longer in userspace (if there is | 
 | a delay, it is because the remote core is in the kernel, possibly with | 
 | interrupts disabled).  Because of this, userspace will be unable to read the new | 
 | value of the seq_ctr before the IPI hits and does the unmapping that the seq_ctr | 
 | protects/advertises.  This is most likely true.  It wouldn't be if the "last IPI | 
 | was sent" flag clears before the IPI actually hit the other core. | 
 |  | 
 | 4.9: Future Broadcast/Messaging Needs | 
 | --------------------------- | 
 | Currently, messaging is serialized.  Broadcast IPIs exist, but the kernel | 
 | message system is based on adding an k_msg to a list in a pcore's | 
 | per_cpu_info.  Further the sending of these messages is in a loop.  In the | 
 | future, we would like to have broadcast messaging of some sort (literally a | 
 | broadcast, like the IPIs, and if not that, then a communication tree of | 
 | sorts).   | 
 |  | 
 | In the past, (OLD INFO): given those desires, we wanted to make sure that no | 
 | message we send needs details specific to a pcore (such as the vcoreid running | 
 | on it, a history number, or anything like that).  Thus no k_msg related to | 
 | process management would have anything that cannot apply to the entire | 
 | process.  At this point, most just have a struct proc *.  A pcore was be able | 
 | to figure out what is happening based on the pcoremap, information in the | 
 | struct proc, and in the preempt struct in procdata. | 
 |  | 
 | In more recent revisions, the coremap no longer is meant to be used across | 
 | kmsgs, so some messages ('__startcore') send the vcoreid.  This means we can't | 
 | easily broadcast the message.  However, many broadcast mechanisms wouldn't | 
 | handle '__startcore' naturally.  For instance, logical IPIs need something | 
 | already set in the LAPIC, or maybe need to be sent to a somewhat predetermined | 
 | group (again, bits in the LAPIC).  If we tried this for '__startcore', we | 
 | could add something in to the messaging to carry these vcoreids.  More likely, | 
 | we'll have a broadcast tree.  Keeping vcoreid (or any arg) next to whoever | 
 | needs to receive the message is a very small amount of bookkeeping on a struct | 
 | that already does bookkeeping. | 
 |  | 
 | 4.10: Other Things We Thought of but Don't Like | 
 | --------------------------- | 
 | All local fate-related work is sent as a self k_msg, to enforce ordering. | 
 | It doesn't capture the difference between a local call and a remote k_msg. | 
 | The k_msg has already considered state and made its decision.  The local call | 
 | is an attempt.  It is also unnecessary, if we put in enough information to | 
 | make a decision in the proc struct.  Finally, it caused a few other problems | 
 | (like needing to detect arbitrary stale messages). | 
 |  | 
 | Overall message history: doesn't work well when you do per-core stuff, since | 
 | it will invalidate other messages for the process.  We then though of a pcore | 
 | history counter to detect stale messages.  Don't like that either.  We'd have | 
 | to send the history in the message, since it's a per-message, per-core | 
 | expiration.  There might be other ways around this, but this doesn't seem | 
 | necessary. | 
 |  | 
 | Alarms have pointers to a list of which cores should be preempted when that | 
 | specific alarm goes off (saved with the alarm).  Ugh.  It gets ugly with | 
 | multiple outstanding preemptions and cores getting yielded while the alarms | 
 | sleep (and possibly could get reallocated, though we'd make a rule to prevent | 
 | that).  Like with notifications, being able to handle spurious alarms and | 
 | thinking of an alarm as just a prod to check somewhere is much more flexible | 
 | and simple.  It is similar to generic messages that have the actual important | 
 | information stored somewhere else (as with allowing broadcasts, with different | 
 | receivers performing slightly different operations). | 
 |  | 
 | Synchrony for messages (wanting a response to a preempt k_msg, for example) | 
 | sucks.  Just encode the state of impending fate in the proc struct, where it | 
 | belongs.  Additionally, we don't want to hold the proc lock even longer than | 
 | we do now (which is probably too long as it is).  Finally, it breaks a golden | 
 | rule: never wait while holding a lock: you will deadlock the system (e.g. if | 
 | the receiver is already in the kernel spinning on the lock).  We'd have to | 
 | send messages, unlock (which might cause a message to hit the calling pcore, | 
 | as in the case of locally called proc_destroy()), and in the meantime some | 
 | useful invariant might be broken. | 
 |  | 
 | We also considered using the transition stack as a signal that a process is in | 
 | a notification handler.  The kernel can inspect the stack pointer to determine | 
 | this.  It's possible, but unnecessary. | 
 |  | 
 | Using the pcoremap as a way to pass info with kmsgs: it worked a little, but | 
 | had some serious problems, as well as making life difficult.  It had two | 
 | purposes: help with local fate calls (yield) and allow broadcast messaging. | 
 | The main issue is that it was using a global struct to pass info with | 
 | messages, but it was based on the snapshot of state at the time the message | 
 | was sent.  When you send a bunch of messages, certain state may have changed | 
 | between messages, and the old snapshot isn't there anymore by the time the | 
 | message gets there.  To avoid this, we went through some hoops and had some | 
 | fragile code that would use other signals to avoid those scenarios where the | 
 | global state change would send the wrong message.  It was tough to understand, | 
 | and not clear it was correct (hint: it wasn't).  Here's an example (on one | 
 | pcore): if we send a preempt and we then try to map that pcore to another | 
 | vcore in the same process before the preempt call checks its pcoremap, we'll | 
 | clobber the pcore->vcore mapping (used by startcore) and the preempt will | 
 | think it is the new vcore, not the one it was when the message was sent. | 
 | While this is a bit convoluted, I can imagine a ksched doing this, and | 
 | perhaps with weird IRQ delays, the messages might get delayed enough for it to | 
 | happen.  I'd rather not have to have the ksched worry about this just because | 
 | proc code was old and ghetto.  Another reason we changed all of this was so | 
 | that you could trust the vcoremap while holding the lock.  Otherwise, it's | 
 | actually non-trivial to know the state of a vcore (need to check a combination | 
 | of preempt_served and is_mapped), and even if you do that, there are some | 
 | complications with doing this in the ksched. | 
 |  | 
 | 5. current_ctx and owning_proc | 
 | =========================== | 
 | Originally, current_tf was a per-core macro that returns a struct trapframe * | 
 | that points back on the kernel stack to the user context that was running on | 
 | the given core when an interrupt or trap happened.  Saving the reference to | 
 | the TF helps simplify code that needs to do something with the TF (like save | 
 | it and pop another TF).  This way, we don't need to pass the context all over | 
 | the place, especially through code that might not care. | 
 |  | 
 | Then, current_tf was more broadly defined as the user context that should be | 
 | run when the kernel is ready to run a process.  In the older case, it was when | 
 | the kernel tries to return to userspace from a trap/interrupt.  current_tf | 
 | could be set by an IPI/KMSG (like '__startcore') so that when the kernel wants | 
 | to idle, it will find a current_tf that it needs to run, even though we never | 
 | trapped in on that context in the first place. | 
 |  | 
 | Finally, current_tf was changed to current_ctx, and instead of tracking a | 
 | struct trapframe (equivalent to a hw_trapframe), it now tracked a struct | 
 | user_context, which could be either a HW or a SW trapframe. | 
 |  | 
 | Further, we now have 'owning_proc', which tells the kernel which process | 
 | should be run.  'owning_proc' is a bigger deal than 'current_ctx', and it is | 
 | what tells us to run cur_ctx. | 
 |  | 
 | Process management KMSGs now simply modify 'owning_proc' and cur_ctx, as if we | 
 | had interrupted a process.  Instead of '__startcore' forcing the kernel to | 
 | actually run the process and trapframe, it will just mean we will eventually | 
 | run it.  In the meantime a '__notify' or a '__preempt' can come in, and they | 
 | will apply to the owning_proc/cur_ctx.  This greatly simplifies process code | 
 | and code calling process code (like the scheduler), since we no longer need to | 
 | worry about whether or not we are getting a "stack killing" kernel message. | 
 | Before this, code needed to care where it was running when managing _Ms. | 
 |  | 
 | Note that neither 'current_ctx' nor 'owning_proc' rely on 'current'/'cur_proc'. | 
 | 'current' is just what process context we're in, not what process (and which | 
 | trapframe) we will eventually run. | 
 |  | 
 | cur_ctx does not point to kernel trapframes, which is important when we | 
 | receive an interrupt in the kernel.  At one point, we were (hypothetically) | 
 | clobbering the reference to the user trapframe, and were unable to recover. | 
 | We can get away with this because the kernel always returns to its previous | 
 | context from a nested handler (via iret on x86).   | 
 |  | 
 | In the future, we may need to save kernel contexts and may not always return | 
 | via iret.  At which point, if the code path is deep enough that we don't want | 
 | to carry the TF pointer, we may revisit this.  Until then, current_ctx is just | 
 | for userspace contexts, and is simply stored in per_cpu_info. | 
 |  | 
 | Brief note from the future (months after this paragraph was written): cur_ctx | 
 | has two aspects/jobs: | 
 | 1) tell the kernel what we should do (trap, fault, sysc, etc), how we came | 
 | into the kernel (the fact that it is a user tf), which is why we copy-out | 
 | early on | 
 | 2) be a vehicle for us to restart the process/vcore | 
 |  | 
 | We've been focusing on the latter case a lot, since that is what gets | 
 | removed when preempted, changed during a notify, created during a startcore, | 
 | etc.  Don't forget it was also an instruction of sorts.  The former case is | 
 | always true throughout the life of the syscall.  The latter only happens to be | 
 | true throughout the life of a *non-blocking* trap since preempts are routine | 
 | KMSGs.  But if we block in a syscall, the cur_ctx is no longer the TF we came | 
 | in on (and possibly the one we are asked to operate on), and that old cur_ctx | 
 | has probably restarted. | 
 |  | 
 | (Note that cur_ctx is a pointer, and syscalls/traps actually operate on the TF | 
 | they came in on regardless of what happens to cur_ctx or pcpui->actual_tf.) | 
 |  | 
 | 6. Locking! | 
 | =========================== | 
 | 6.1: proc_lock | 
 | --------------------------- | 
 | Currently, all locking is done on the proc_lock.  It's main goal is to protect | 
 | the vcore mapping (vcore->pcore and vice versa).  As of Apr 2010, it's also used | 
 | to protect changes to the address space and the refcnt.  Eventually the refcnt | 
 | will be handled with atomics, and the address space will have it's own MM lock.   | 
 |  | 
 | We grab the proc_lock all over the place, but we try to avoid it whereever | 
 | possible - especially in kernel messages or other places that will be executed | 
 | in parallel.  One place we do grab it but would like to not is in proc_yield().   | 
 | We don't always need to grab the proc lock.  Here are some examples: | 
 |  | 
 | 6.1.1: Lockless Notifications: | 
 | ------------- | 
 | We don't lock when sending a notification.  We want the proc_lock to not be an | 
 | irqsave lock (discussed below).  Since we might want to send a notification from | 
 | interrupt context, we can't grab the proc_lock if it's a regular lock.   | 
 |  | 
 | This is okay, since the proc_lock is only protecting the vcoremapping.  We could | 
 | accidentally send the notification to the wrong pcore.  The __notif handler | 
 | checks to make sure it is the right process, and all _M processes should be able | 
 | to handle spurious notifications.  This assumes they are still _M. | 
 |  | 
 | If we send it to the wrong pcore, there is a danger of losing the notif, since | 
 | it didn't go to the correct vcore.  That would happen anyway, (the vcore is | 
 | unmapped, or in the process of mapping).  The notif_pending flag will be caught | 
 | when the vcore is started up next time (and that flag was set before reading the | 
 | vcoremap). | 
 |  | 
 | 6.1.2: Local get_vcoreid(): | 
 | ------------- | 
 | It's not necessary to lock while checking the vcoremap if you are checking for | 
 | the core you are running on (e.g. pcoreid == core_id()).  This is because all | 
 | unmappings of a vcore are done on the receive side of a routine kmsg, and that | 
 | code cannot run concurrently with the code you are running.   | 
 |  | 
 | 6.2: irqsave | 
 | --------------------------- | 
 | The proc_lock used to be an irqsave lock (meaning it disables interrupts and can | 
 | be grabbed from interrupt context).  We made it a regular lock for a couple | 
 | reasons.  The immediate one was it was causing deadlocks due to some other | 
 | ghetto things (blocking on the frontend server, for instance).  More generally, | 
 | we don't want to disable interrupts for long periods of time, so it was | 
 | something worth doing anyway.   | 
 |  | 
 | This means that we cannot grab the proc_lock from interrupt context.  This | 
 | includes having schedule called from an interrupt handler (like the | 
 | timer_interrupt() handler), since it will call proc_run.  Right now, we actually | 
 | do this, which we shouldn't, and that will eventually get fixed.  The right | 
 | answer is that the actual work of running the scheduler should be a routine | 
 | kmsg, similar to how Linux sets a bit in the kernel that it checks on the way | 
 | out to see if it should run the scheduler or not. | 
 |  | 
 | 7. TLB Coherency | 
 | =========================== | 
 | When changing or removing memory mappings, we need to do some form of a TLB | 
 | shootdown.  Normally, this will require sending an IPI (immediate kmsg) to | 
 | every vcore of a process to unmap the affected page.  Before allocating that | 
 | page back out, we need to make sure that every TLB has been flushed.   | 
 |  | 
 | One reason to use a kmsg over a simple handler is that we often want to pass a | 
 | virtual address to flush for those architectures (like x86) that can | 
 | invalidate a specific page.  Ideally, we'd use a broadcast kmsg (doesn't exist | 
 | yet), though we already have simple broadcast IPIs. | 
 |  | 
 | 7.1 Initial Stuff | 
 | --------------------------- | 
 | One big issue is whether or not to wait for a response from the other vcores | 
 | that they have unmapped.  There are two concerns: 1) Page reuse and 2) User | 
 | semantics.  We cannot give out the physical page while it may still be in a | 
 | TLB (even to the same process.  Ask us about the pthread_test bug). | 
 |  | 
 | The second case is a little more detailed.  The application may not like it if | 
 | it thinks a page is unmapped or protected, and it does not generate a fault. | 
 | I am less concerned about this, especially since we know that even if we don't | 
 | wait to hear from every vcore, we know that the message was delivered and the | 
 | IPI sent.  Any cores that are in userspace will have trapped and eventually | 
 | handle the shootdown before having a chance to execute other user code.  The | 
 | delays in the shootdown response are due to being in the kernel with | 
 | interrupts disabled (it was an IMMEDIATE kmsg). | 
 |  | 
 | 7.2 RCU | 
 | --------------------------- | 
 | One approach is similar to RCU.  Unmap the page, but don't put it on the free | 
 | list.  Instead, don't reallocate it until we are sure every core (possibly | 
 | just affected cores) had a chance to run its kmsg handlers.  This time is | 
 | similar to the RCU grace periods.  Once the period is over, we can then truly | 
 | free the page. | 
 |  | 
 | This would require some sort of RCU-like mechanism and probably a per-core | 
 | variable that has the timestamp of the last quiescent period.  Code caring | 
 | about when this page (or pages) can be freed would have to check on all of the | 
 | cores (probably in a bitmask for what needs to be freed).  It would make sense | 
 | to amortize this over several RCU-like operations. | 
 |  | 
 | 7.3 Checklist | 
 | --------------------------- | 
 | It might not suck that much to wait for a response if you already sent an IPI, | 
 | though it incurs some more cache misses.  If you wanted to ensure all vcores | 
 | ran the shootdown handler, you'd have them all toggle their bit in a checklist | 
 | (unused for a while, check smp.c).  The only one who waits would be the | 
 | caller, but there still are a bunch of cache misses in the handlers.  Maybe | 
 | this isn't that big of a deal, and the RCU thing is an unnecessary | 
 | optimization. | 
 |  | 
 | 7.4 Just Wait til a Context Switch | 
 | --------------------------- | 
 | Another option is to not bother freeing the page until the entire process is | 
 | descheduled.  This could be a very long time, and also will mess with | 
 | userspace's semantics.  They would be running user code that could still | 
 | access the old page, so in essence this is a lazy munmap/mprotect.  The | 
 | process basically has the page in pergatory: it can't be reallocated, and it | 
 | might be accessible, but can't be guaranteed to work. | 
 |  | 
 | The main benefit of this is that you don't need to send the TLB shootdown IPI | 
 | at all - so you don't interfere with the app.  Though in return, they have | 
 | possibly weird semantics.  One aspect of these weird semantics is that the | 
 | same virtual address could map to two different pages - that seems like a | 
 | disaster waiting to happen.  We could also block that range of the virtual | 
 | address space from being reallocated, but that gets even more tricky. | 
 |  | 
 | One issue with just waiting and RCU is memory pressure.  If we actually need | 
 | the page, we will need to enforce an unmapping, which sucks a little. | 
 |  | 
 | 7.5 Bulk vs Single | 
 | --------------------------- | 
 | If there are a lot of pages being shot down, it'd be best to amortize the cost | 
 | of the kernel messages, as well as the invlpg calls (single page shootdowns). | 
 | One option would be for the kmsg to take a range, and not just a single | 
 | address.  This would help with bulk munmap/mprotects.  Based on the number of | 
 | these, perhaps a raw tlbflush (the entire TLB) would be worth while, instead | 
 | of n single shots.  Odds are, that number is arch and possibly workload | 
 | specific. | 
 |  | 
 | For now, the plan will be to send a range and have them individually shot | 
 | down. | 
 |  | 
 | 7.6 Don't do it | 
 | --------------------------- | 
 | Either way, munmap/mprotect sucks in an MCP.  I recommend not doing it, and | 
 | doing the appropriate mmap/munmap/mprotects in _S mode.  Unfortunately, even | 
 | our crap pthread library munmaps on demand as threads are created and | 
 | destroyed.  The vcore code probably does in the bowels of glibc's TLS code | 
 | too, though at least that isn't on every user context switch. | 
 |  | 
 | 7.7 Local memory | 
 | --------------------------- | 
 | Private local memory would help with this too.  If each vcore has its own | 
 | range, we won't need to send TLB shootdowns for those areas, and we won't have | 
 | to worry about weird application semantics.  The downside is we would need to | 
 | do these mmaps in certain ranges in advance, and might not easily be able to | 
 | do them remotely.  More on this when we actually design and build it. | 
 |  | 
 | 7.8 Future Hardware Support | 
 | --------------------------- | 
 | It would be cool and interesting if we had the ability to remotely shootdown | 
 | TLBs.  For instance, all cores with cr3 == X, shootdown range Y..Z.  It's | 
 | basically what we'll do with the kernel message and the vcoremap, but with | 
 | magic hardware. | 
 |  | 
 | 7.9 Current Status | 
 | --------------------------- | 
 | For now, we just send a kernel message to all vcores to do a full TLB flush, | 
 | and not to worry about checklists, waiting, or anything.  This is due to being | 
 | short on time and not wanting to sort out the issue with ranges.  The way | 
 | it'll get changed to is to send the kmsg with the range to the appropriate | 
 | cores, and then maybe put the page on the end of the freelist (instead of the | 
 | head).  More to come. | 
 |  | 
 | 8. Process Management | 
 | =========================== | 
 | 8.1 Vcore lists | 
 | --------------------------- | 
 | We have three lists to track a process's vcores.  The vcores themselves sit in | 
 | the vcoremap in procinfo; they aren't dynamically allocated (memory) or | 
 | anything like that.  The lists greatly eases vcore discovery and management. | 
 |  | 
 | A vcore is on exactly one of three lists: online (mapped and running vcores, | 
 | sometimes called 'active'), bulk_preempt (was online when the process was bulk | 
 | preempted (like a timeslice)), and inactive (yielded, hasn't come on yet, | 
 | etc).  When writes are complete (unlocked), either the online list or the | 
 | bulk_preempt list should be empty. | 
 |  | 
 | List modifications are protected by the proc_lock.  You can concurrently read, | 
 | but note you may get some weird behavior, such as a vcore on multiple lists, a | 
 | vcore on no lists, online and bulk_preempt both having items, etc.  Currently, | 
 | event code will read these lists when hunting for a suitable core, and will | 
 | have to be careful about races.  I want to avoid event FALLBACK code from | 
 | grabbing the proc_lock. | 
 |  | 
 | Another slight thing to be careful of is that the vcore lists don't always | 
 | agree with the vcore mapping.  However, it will always agree with what the | 
 | state of the process will be when all kmsgs are processed (fate). | 
 | Specifically, when we take vcores, the unmapping happens with the lock not | 
 | held on the vcore itself (as discussed elsewhere).  The vcore lists represent | 
 | the result of those pending unmaps. | 
 |  | 
 | Before we used the lists, we scanned the vcoremap in a painful, clunky manner. | 
 | In the old style, when you asked for a vcore, the first one you got was the | 
 | first hole in the vcoremap.  Ex: Vcore0 would always be granted if it was | 
 | offline.  That's no longer true; the most recent vcore yielded will be given | 
 | out next.  This will help with cache locality, and also cuts down on the | 
 | scenarios on which the kernel gives out a vcore that userspace wasn't | 
 | expecting.  This can still happen if they ask for more vcores than they set up | 
 | for, or if a vcore doesn't *want* to come online (there's a couple scenarios | 
 | with preemption recovery where that may come up). | 
 |  | 
 | So the plan with the bulk preempt list is that vcores on it were preempted, | 
 | and the kernel will attempt to restart all of them (and move them to the online | 
 | list).  Any leftovers will be moved to the inactive list, and have preemption | 
 | recovery messages sent out.  Any shortages (they want more vcores than were | 
 | bulk_preempted) will be taken from the yield list.  This all means that | 
 | whether or not a vcore needs to be preempt-recovered or if there is a message | 
 | out about its preemption doesn't really affect which list it is on.  You could | 
 | have a vcore on the inactive list that was bulk preempted (and not turned back | 
 | on), and then that vcore gets granted in the next round of vcore_requests(). | 
 | The preemption recovery handlers will need to deal with concurrent handlers | 
 | and the vcore itself starting back up. | 
 |  | 
 | 9. On the Ordering of Messages and Bugs with Old State | 
 | =========================== | 
 | This is a sordid tale involving message ordering, message delivery times, and | 
 | finding out (sometimes too late) that the state you expected is gone and | 
 | having to deal with that error. | 
 |  | 
 | A few design issues: | 
 | - being able to send messages and have them execute in the order they are | 
 |   sent | 
 | - having message handlers resolve issues with global state.  Some need to know | 
 |   the correct 'world view', and others need to know what was the state at the | 
 |   time they were sent. | 
 | - realizing syscalls, traps, faults, and any non-IRQ entry into the kernel is | 
 |   really a message. | 
 |  | 
 | Process management messages have alternated from ROUTINE to IMMEDIATE and now | 
 | back to ROUTINE.  These messages include such family favorites as | 
 | '__startcore', '__preempt', etc.  Meanwhile, syscalls were coming in that | 
 | needed to know about the core and the process's state (specifically, yield, | 
 | change_to, and get_vcoreid).  Finally, we wanted to avoid locking, esp in | 
 | KMSGs handlers (imagine all cores grabbing the lock to check the vcoremap or | 
 | something). | 
 |  | 
 | Incidentally, events were being delivered concurretly to vcores, though that | 
 | actually didn't matter much (check out async_events.txt for more on that). | 
 |  | 
 | 9.1: Design Guidelines | 
 | --------------------------- | 
 | Initially, we wanted to keep broadcast messaging available as an option.  As | 
 | noted elsewhere, we can't really do this well for startcore, since most | 
 | hardware broadcast options need some initial per-core setup, and any sort of | 
 | broadcast tree we make should be able to handle a small message.  Anyway, this | 
 | desire in the early code to keep all messages identical lead to a few | 
 | problems. | 
 |  | 
 | Another objective of the kernel messaging was to avoid having the message | 
 | handlers grab any locks, especially the same lock (the proc lock is used to | 
 | protect the vcore map, for instance). | 
 |  | 
 | Later on, a few needs popped up that motivated the changes discussed below: | 
 | - Being able to find out which proc/vcore was on a pcore | 
 | - Not having syscalls/traps require crazy logic if the carpet was pulled out | 
 |   from under them. | 
 | - Having proc management calls return.  This one was sorted out by making all | 
 |   kmsg handlers return.  It would be a nightmare making a ksched without this. | 
 |  | 
 | 9.2: Looking at Old State: a New Bug for an Old Problem | 
 | --------------------------- | 
 | We've always had issues with syscalls coming in and already had the fate of a | 
 | core determined.  This is referred to in a few places as "predetermined fate" | 
 | vs "local state".  A remote lock holder (ksched) already determined a core | 
 | should be unmapped and sent a message.  Only later does some call like | 
 | proc_yield() realize its core is already *unmapped*. (I use that term poorly | 
 | here).  This sort of code had to realize it was working on an old version of | 
 | state and just abort.  This was usually safe, though looking at the vcoremap | 
 | was a bad idea.  Initially, we used preempt_served as the signal, which was | 
 | okay.  Around 12b06586 yield started to use the vcoremap, which turned out to | 
 | be wrong. | 
 |  | 
 | A similar issue happens for the vcore messages (startcore, preempt, etc).  The | 
 | way startcore used to work was that it would only know what pcore it was on, | 
 | and then look into the vcoremap to figure out what vcoreid it should be | 
 | running.  This was to keep broadcast messaging available as an option.  The | 
 | problem with it is that the vcoremap may have changed between when the | 
 | messages were sent and when they were executed.  Imagine a startcore followed | 
 | by a preempt, afterwhich the vcore was unmapped.  Well, to get around that, we | 
 | had the unmapping happen in the preempt or death handlers.  Yikes!  This was | 
 | the case back in the early days of ROS.  This meant the vcoremap wasn't | 
 | actually representative of the decisions the ksched made - we also needed to | 
 | look at the state we'd have after all outstanding messages executed.  And this | 
 | would differ from the vcore lists (which were correct for a lock holder). | 
 |  | 
 | This was managable for a little while, until I tried to conclusively know who | 
 | owned a particular pcore.  This came up while making a provisioning scheduler. | 
 | Given a pcore, tell me which process/vcore (if any) were on it.  It was rather | 
 | tough.  Getting the proc wasn't too hard, but knowing which vcore was a little | 
 | tougher.  (Note the ksched doesn't care about which vcore is running, and the | 
 | process can change vcores on a pcore at will).  But once you start looking at | 
 | the process, you can't tell which vcore a certain pcore has.  The vcoremap may | 
 | be wrong, since a preempt is already on the way.  You would have had to scan | 
 | the vcore lists to see if the proc code thought that vcore was online or not | 
 | (which would mean there had been no preempts).  This is the pain I was talking | 
 | about back around commit 5343a74e0. | 
 |  | 
 | So I changed things so that the vcoremap was always correct for lock holders, | 
 | and used pcpui to track owning_vcoreid (for preempt/notify), and used an extra | 
 | KMSG variable to tell startcore which vcoreid it should use.  In doing so, we | 
 | (re)created the issue that the delayed unmapping dealt with: the vcoremap | 
 | would represent *now*, and not the vcoremap of when the messages were first | 
 | sent.  However, this had little to do with the KMSGs, which I was originally | 
 | worried about.  No one was looking at the vcoremap without the lock, so the | 
 | KMSGs were okay, but remember: syscalls are like messages too.  They needed to | 
 | figure out what vcore they were on, i.e. what vcore userspace was making | 
 | requests on (viewing a trap/fault as a type of request). | 
 |  | 
 | Now the problem was that we were using the vcoremap to figure out which vcore | 
 | we were supposed to be.  When a syscall finally ran, the vcoremap could be | 
 | completely wrong, and with immediate KMSGs (discussed below), the pcpui was | 
 | already changed!  We dealt with the problem for KMSGs, but not syscalls, and | 
 | basically reintroduced the bug of looking at current state and thinking it | 
 | represented the state from when the 'message' was sent (when we trapped into | 
 | the kernel, for a syscall/exception). | 
 |  | 
 | 9.3: Message Delivery, Circular Waiting, and Having the Carpet Pulled Out | 
 | --------------------------- | 
 | In-order message delivery was what drove me to build the kernel messaging | 
 | system in the first place.  It provides in-order messages to a particular | 
 | pcore.  This was enough for a few scenarios, such as preempts racing ahead of | 
 | startcores, or deaths racing a head of preempts, etc.  However, I also wanted | 
 | an ordering of messages related to a particular vcore, and this wasn't | 
 | apparent early on. | 
 |  | 
 | The issue first popped up with a startcore coming quickly on the heals of a | 
 | preempt for the same VC, but on different PCs.  The startcore cannot proceed | 
 | until the preempt saved the TF into the VCPD.  The old way of dealing with | 
 | this was to spin in '__map_vcore()'.  This was problematic, since it meant we | 
 | were spinning while holding a lock, and resulted in some minor bugs and issues | 
 | with lock ordering and IRQ disabling (couldn't disable IRQs and then try to | 
 | grab the lock, since the lock holder could have sent you a message and is | 
 | waiting for you to handle the IRQ/IMMED KMSG).  However, it was doable.  But | 
 | what wasn't doable was to have the KMSGs be ROUTINE.  Any syscalls that tried | 
 | to grab the proc lock (lots of them) would deadlock, since the lock holder was | 
 | waiting on us to handle the preempt (same circular waiting issue as above). | 
 |  | 
 | This was fine, albeit subpar, until a new issue showed up.  Sending IMMED | 
 | KMSGs worked fine if we were coming from userspace already, but if we were in | 
 | the kernel, those messages would run immediately (hence the name), just like | 
 | an IRQ handler, and could confuse syscalls that touched cur_ctx/pcpui.  If a | 
 | preempt came in during a syscall, the process/vcore could be changed before | 
 | the syscall took place.  Some syscalls could handle this, albeit poorly. | 
 | sys_proc_yield() and sys_change_vcore() delicately tried to detect if they | 
 | were still mapped or not and use that to determine if a preemption happened. | 
 |  | 
 | As mentioned above, looking at the vcoremap only tells you what is currently | 
 | happening, and not what happened in the past.  Specifically, it doesn't tell | 
 | you the state of the mapping when a particular core trapped into the kernel | 
 | for a syscall (referred to as when the 'message' was sent up above).  Imagine | 
 | sys_get_vcoreid(): you trap in, then immediately get preempted, then startcore | 
 | for the same process but a different vcoreid.  The syscall would return with | 
 | the vcoreid of the new vcore, since it cannot tell there was a change.  The | 
 | async syscall would complete and we'd have a wrong answer.  While this never | 
 | happened to me, I had a similar issue while debugging some other bugs (I'd get | 
 | a vcoreid of 0xdeadbeef, for instance, which was the old poison value for an | 
 | unmapped vcoreid).  There are a bunch of other scenarios that trigger similar | 
 | disasters, and they are very hard to avoid. | 
 |  | 
 | One way out of this was a per-core history counter, that changed whenever we | 
 | changed cur_ctx.  Then when we trapped in for a syscall, we could save the | 
 | value, enable_irqs(), and go about our business.  Later on, we'd have to | 
 | disable_irqs() and compare the counters.  If they were different, we'd have to | 
 | bail out some how.  This could have worked for change_to and yield, and some | 
 | others.  But any syscall that wanted to operate on cur_ctx in some way would | 
 | fail (imagine a hypothetical sys_change_stack_pointer()).  The context that | 
 | trapped has already returned on another core.  I guess we could just fail that | 
 | syscall, though it seems a little silly to not be able to do that. | 
 |  | 
 | The previous example was a bit contrived, but lets also remember that it isn't | 
 | just syscalls: all exceptions have the same issue.  Faults might be fixable, | 
 | since if you restart a faulting context, it will start on the faulting | 
 | instruction.  However all traps (like syscall) restart on the next | 
 | instruction.  Hope we don't want to do anything fancy with breakpoint!  Note | 
 | that I had breakpointing contexts restart on other pcores and continue while I | 
 | was in the breakpoint handler (noticed while I was debugging some bugs with | 
 | lots of preempts).  Yikes.  And don't forget we eventually want to do some | 
 | complicated things with the page fault handler, and may want to turn on | 
 | interrupts / kthread during a page fault (imaging hitting disk).  Yikes. | 
 |  | 
 | So I looked into going back to ROUTINE kernel messages.  With ROUTINE | 
 | messages, I didn't have to worry about having the carpet pulled out from under | 
 | syscalls and exceptions (traps, faults, etc).  The 'carpet' is stuff like | 
 | cur_ctx, owning_proc, owning_vcoreid, etc.  We still cannot trust the vcoremap, | 
 | unless we *know* there were no preempts or other KMSGs waiting for us. | 
 | (Incidentally, in the recent fix a93aa7559, we merely use the vcoremap as a | 
 | sanity check). | 
 |  | 
 | However, we can't just switch back to ROUTINEs.  Remember: with ROUTINEs, | 
 | we will deadlock in '__map_vcore()', when it waits for the completion of | 
 | preempt.  Ideally, we would have had startcore spin on the signal.  Since we | 
 | already gave up on using x86-style broadcast IPIs for startcore (in | 
 | 5343a74e0), we might as well pass along a history counter, so it knows to wait | 
 | on preempt. | 
 |  | 
 | 9.4: The Solution | 
 | --------------------------- | 
 | To fix up all of this, we now detect preemptions in syscalls/traps and order | 
 | our kernel messages with two simple per-vcore counters.  Whenever we send a | 
 | preempt, we up one counter.  Whenever that preempt finishes, it ups another | 
 | counter.  When we send out startcores, we send a copy of the first counter. | 
 | This is a way of telling startcore where it belongs in the list of messages. | 
 | More specifically, it tells it which preempt happens-before it. | 
 |  | 
 | Basically, I wanted a partial ordering on my messages, so that messages sent | 
 | to a particular vcore are handled in the order they were sent, even if those | 
 | messages run on different physical cores. | 
 |  | 
 | It is not sufficient to use a seq counter (one integer, odd values for | 
 | 'preempt in progress' and even values for 'preempt done').  It is possible to | 
 | have multiple preempts in flight for the same vcore, albeit with startcores in | 
 | between.  Still, there's no way to encode that scenario in just one counter. | 
 |  | 
 | Here's a normal example of traffic to some vcore.  I note both the sending and | 
 | the execution of the kmsgs: | 
 |    nr_pre_sent    nr_pre_done    pcore     message sent/status | 
 |    ------------------------------------------------------------- | 
 |    0              0              X         startcore (nr_pre_sent == 0) | 
 |    0              0              X         startcore (executes) | 
 |    1              0              X         preempt   (kmsg sent) | 
 |    1              1              Y         preempt   (executes) | 
 |    1              1              Y         startcore (nr_pre_sent == 1) | 
 |    1              1              Y         startcore (executes) | 
 |  | 
 | Note the messages are always sent by the lockholder in the order of the | 
 | example above. | 
 |  | 
 | Here's when the startcore gets ahead of the prior preempt: | 
 |    nr_pre_sent    nr_pre_done    pcore     message sent/status | 
 |    ------------------------------------------------------------- | 
 |    0              0              X         startcore (nr_pre_sent == 0)  | 
 |    0              0              X         startcore (executes) | 
 |    1              0              X         preempt   (kmsg sent) | 
 |    1              0              Y         startcore (nr_pre_sent == 1) | 
 |    1              1              X         preempt   (executes) | 
 |    1              1              Y         startcore (executes) | 
 |  | 
 | Note that this can only happen across cores, since KMSGs to a particular core | 
 | are handled in order (for a given class of message).  The startcore blocks on | 
 | the prior preempt. | 
 |  | 
 | Finally, here's an example of what a seq ctr can't handle: | 
 |    nr_pre_sent    nr_pre_done    pcore     message sent/status | 
 |    ------------------------------------------------------------- | 
 |    0              0              X         startcore (nr_pre_sent == 0)  | 
 |    1              0              X         preempt   (kmsg sent) | 
 |    1              0              Y         startcore (nr_pre_sent == 1) | 
 |    2              0              Y         preempt   (kmsg sent) | 
 |    2              0              Z         startcore (nr_pre_sent == 2) | 
 |    2              1              X         preempt   (executes (upped to 1)) | 
 |    2              1              Y         startcore (executes (needed 1)) | 
 |    2              2              Y         preempt   (executes (upped to 2)) | 
 |    2              Z              Z         startcore (executes (needed 2)) | 
 |  | 
 | As a nice bonus, it is easy for syscalls that care about the vcoreid (yield, | 
 | change_to, get_vcoreid) to check if they have a preempt_served.  Just grab the | 
 | lock (to prevent further messages being sent), then check the counters.  If | 
 | they are equal, there is no preempt on its way.  This actually was the | 
 | original way we checked for preempts in proc_yield back in the day.  It was | 
 | just called preempt_served.  Now, it is split into two counters, instead of | 
 | just being a bool.   | 
 |  | 
 | Regardless of whether or not we were preempted, we still can look at | 
 | pcpui->owning_proc and owning_vcoreid to figure out what the vcoreid of the | 
 | trap/syscall is, and we know that the cur_ctx is still the correct cur_ctx (no | 
 | carpet pulled out), since while there could be a preempt ROUTINE message | 
 | waiting for us, we simply haven't run it yet.  So calls like yield should | 
 | still fail (since your core has been unmapped and you need to bail out and run | 
 | the preempt handler), but calls like sys_change_stack_pointer can proceed. | 
 | More importantly than that old joke syscall, the page fault handler can try to | 
 | do some cool things without worrying about really crazy stuff. | 
 |  | 
 | 9.5: Why We (probably) Don't Deadlock | 
 | --------------------------- | 
 | It's worth thinking about why this setup of preempts and startcores can't | 
 | deadlock.  Anytime we spin in the kernel, we ought to do this.  Perhaps there | 
 | is some issue with other KMSGs for other processes, or other vcores, or | 
 | something like that that can cause a deadlock. | 
 |  | 
 | Hypothetical case: pcore 1 has a startcore for vc1 which is stuck behind vc2's | 
 | startcore on PC2, with time going upwards.  In these examples, startcores are | 
 | waiting on particular preempts, subject to the nr_preempts_sent parameter sent | 
 | along with the startcores. | 
 |  | 
 | ^                        | 
 | |            _________                 _________ | 
 | |           |         |               |         | | 
 | |           | pr vc 2 |               | pr vc 1 | | 
 | |           |_________|               |_________| | 
 | | | 
 | |            _________                 _________ | 
 | |           |         |               |         | | 
 | |           | sc vc 1 |               | sc vc 2 | | 
 | |           |_________|               |_________| | 
 | t            | 
 | --------------------------------------------------------------------------- | 
 |               ______                    ______ | 
 |              |      |                  |      | | 
 |              | PC 1 |                  | PC 2 | | 
 |              |______|                  |______| | 
 |  | 
 | Here's the same picture, but with certain happens-before arrows.  We'll use X --> Y to | 
 | mean X happened before Y, was sent before Y.  e.g., a startcore is sent after | 
 | a preempt. | 
 |  | 
 | ^                        | 
 | |            _________                 _________ | 
 | |           |         |               |         | | 
 | |       .-> | pr vc 2 | --.    .----- | pr vc 1 | <-. | 
 | |       |   |_________|    \  /   &   |_________|   |   | 
 | |     * |                   \/                      | *  | 
 | |       |    _________      /\         _________    |   | 
 | |       |   |         |    /  \   &   |         |   |   | 
 | |       '-- | sc vc 1 | <-'    '----> | sc vc 2 | --' | 
 | |           |_________|               |_________| | 
 | t            | 
 | --------------------------------------------------------------------------- | 
 |               ______                    ______ | 
 |              |      |                  |      | | 
 |              | PC 1 |                  | PC 2 | | 
 |              |______|                  |______| | 
 |  | 
 | The arrows marked with * are ordered like that due to the property of KMSGs, | 
 | in that we have in order delivery.  Messages are executed in the order in | 
 | which they were sent (serialized with a spinlock btw), so on any pcore, | 
 | messages that are further ahead in the queue were sent before (and thus will | 
 | be run before) other messages. | 
 |  | 
 | The arrows marked with a & are ordered like that due to how the proc | 
 | management code works.  The kernel won't send out a startcore for a particular | 
 | vcore before it sent out a preempt.  (Note that techincally, preempts follow | 
 | startcores.  The startcores in this example are when we start up a vcore after | 
 | it had been preempted in the past.). | 
 |  | 
 | Anyway, note that we have a cycle, where all events happened before each | 
 | other, which isn't possible.  The trick to connecting "unrelated" events like | 
 | this (unrelated meaning 'not about the same vcore') in a happens-before manner | 
 | is the in-order properties of the KMSGs. | 
 |  | 
 | Based on this example, we can derive general rules.  Note that 'sc vc 2' could | 
 | be any kmsg that waits on another message placed behind 'sc vc 1'.  This would | 
 | require us having sent a KMSG that waits on a KMSGs that we send later.  Bad | 
 | idea!  (you could have sent that KMSGs to yourself, aside from just being | 
 | dangerous).  If you want to spin, make sure you send the work that should | 
 | happen-before actually-before the waiter. | 
 |  | 
 | In fact, we don't even need 'sc vc 2' to be a KMSG.  It could be miscellaneous | 
 | kernel code, like a proc mgmt syscall.  Imagine if we did something like the | 
 | old '__map_vcore' call from within the ksched.  That would be code that holds | 
 | the lock, and then waits on the execution of a message handler.  That would | 
 | deadlock (which is why we don't do it anymore). | 
 |  | 
 | Finally, in case this isn't clear, all of the startcores and preempts for | 
 | a given vcore exist in a happens-before relation, both in sending and in | 
 | execution.  The sending aspect is handled by proc mgmt code.  For execution, | 
 | preempts always follow startcores due to the KMSG ordering property.  For | 
 | execution of startcores, startcores always spin until the preempt they follow | 
 | is complete, ensuring the execution of the main part of their handler happens | 
 | after the prior preempt. | 
 |  | 
 | Here's some good ideas for the ordering of locks/irqs/messages: | 
 | - You can't hold a spinlock of any sort and then wait on a routine kernel | 
 |   message.  The core where that runs may be waiting on you, or some scenario | 
 |   like above. | 
 | 	- Similarly, think about how this works with kthreads.  A kthread | 
 | 	  restart is a routine KMSG.  You shouldn't be waiting on code that | 
 | 	  could end up kthreading, mostly because those calls block! | 
 | - You can hold a spinlock and wait on an IMMED kmsg, if the waiters of the | 
 |   spinlock have irqs enabled while spinning (this is what we used to do with | 
 |   the proc lock and IMMED kmsgs, and 54c6008 is an example of doing it wrong) | 
 | 	- As a corollary, locks like this cannot be irqsave, since the other | 
 | 	  attempted locker will have irq disabled | 
 | - For broadcast trees, you'd have to send IMMEDs for the intermediates, and | 
 |   then it'd be okay to wait on those intermediate, immediate messages (if we | 
 |   wanted confirmation of the posting of RKM) | 
 | 	- The main thing any broadcast mechanism needs to do is make sure all | 
 | 	  messages get delivered in order to particular pcores (the central | 
 | 	  premise of KMSGs) (and not deadlock due to waiting on a KMSG | 
 | 	  improperly) | 
 | - Alternatively, we could use routines for the intermediates if we didn't want | 
 |   to wait for RKMs to hit their destination, we'd need to always use the same | 
 |   proxy for the same destination pcore, e.g., core 16 always covers 16-31. | 
 | 	- Otherwise, we couldn't guarantee the ordering of SC before PR before | 
 | 	  another SC (which the proc_lock and proc mgmt code does); we need the | 
 | 	  ordering of intermediate msgs on the message queues of a particular | 
 | 	  core. | 
 | 	- All kmsgs would need to use this broadcasting style (couldn't mix | 
 | 	  regular direct messages with broadcast), so odds are this style would | 
 | 	  be of limited use. | 
 | 	- since we're not waiting on execution of a message, we could use RKMs | 
 | 	  (while holding a spinlock) | 
 | - There might be some bad effects with kthreads delaying the reception of RKMS | 
 |   for a while, but probably not catastrophically. | 
 |  | 
 | 9.6: Things That We Don't Handle Nicely | 
 | --------------------------- | 
 | If for some reason a syscall or fault handler blocks *unexpectedly*, we could | 
 | have issues.  Imagine if change_to happens to block in some early syscall code | 
 | (like instrumentation, or who knows what, that blocks in memory allocation). | 
 | When the syscall kthread restarts, its old cur_ctx is gone.  It may or may not | 
 | be running on a core owned by the original process.  If it was, we probably | 
 | would accidentally yield that vcore (clearly a bug).   | 
 |  | 
 | For now, any of these calls that care about cur_ctx/pcpui need to not block | 
 | without some sort of protection.  None of them do, but in the future we might | 
 | do something that causes them to block.  We could deal with it by having a | 
 | pcpu or per-kthread/syscall flag that says if it ever blocked, and possibly | 
 | abort.  We get into similar nasty areas as with preempts, but this time, we | 
 | can't solve it by making preempt a routine KMSG - we block as part of that | 
 | syscall/handler code.  Odds are, we'll just have to outlaw this, now and | 
 | forever.  Just note that if a syscall/handler blocks, the TF it came in on is | 
 | probably not cur_ctx any longer, and that old cur_ctx has probably restarted. | 
 |  | 
 | 10. TBD | 
 | =========================== |