|  | process-internals.txt | 
|  | Barret Rhoden | 
|  |  | 
|  | This discusses core issues with process design and implementation.  Most of this | 
|  | info is available in the source in the comments (but may not be in the future). | 
|  | For now, it's a dumping ground for topics that people ought to understand before | 
|  | they muck with how processes work. | 
|  |  | 
|  | Contents: | 
|  | 1. Reference Counting | 
|  | 2. When Do We Really Leave "Process Context"? | 
|  | 3. Leaving the Kernel Stack | 
|  | 4. Preemption and Notification Issues | 
|  | 5. current_ctx and owning_proc | 
|  | 6. Locking! | 
|  | 7. TLB Coherency | 
|  | 8. Process Management | 
|  | 9. On the Ordering of Messages | 
|  | 10. TBD | 
|  |  | 
|  | 1. Reference Counting | 
|  | =========================== | 
|  | 1.1 Basics: | 
|  | --------------------------- | 
|  | Reference counts exist to keep a process alive.  We use krefs for this, similar | 
|  | to Linux's kref: | 
|  | - Can only incref if the current value is greater than 0, meaning there is | 
|  | already a reference to it.  It is a bug to try to incref on something that has | 
|  | no references, so always make sure you incref something that you know has a | 
|  | reference.  If you don't know, you need to get it from pid2proc (which is a | 
|  | careful way of doing this - pid2proc kref_get_not_zero()s on the reference that is | 
|  | stored inside it).  If you incref and there are 0 references, the kernel will | 
|  | panic.  Fix your bug / don't incref random pointers. | 
|  | - Can always decref. | 
|  | - When the decref returns 0, perform some operation.  This does some final | 
|  | cleanup on the object. | 
|  | - Process code is trickier since we frequently make references from 'current' | 
|  | (which isn't too bad), but also because we often do not return and need to be | 
|  | careful about the references we passed in to a no-return function. | 
|  |  | 
|  | 1.2 Brief History of the Refcnt: | 
|  | --------------------------- | 
|  | Originally, the refcnt was created to keep page tables from being destroyed (in | 
|  | proc_free()) while cores were still using them, which is what was happens during | 
|  | an ARSC (async remote syscall).  It was then defined to be a count of places in | 
|  | the kernel that had an interest in the process staying alive, practically just | 
|  | to protect current/cr3.  This 'interest' actually extends to any code holding a | 
|  | pointer to the proc, such as one acquired via pid2proc(), which is its current | 
|  | meaning. | 
|  |  | 
|  | 1.3 Quick Aside: The current Macro: | 
|  | --------------------------- | 
|  | current is a pointer to the proc that is currently loaded/running on any given | 
|  | core.  It is stored in the per_cpu_info struct, and set/managed by low-level | 
|  | process code.  It is necessary for the kernel to quickly figure out who is | 
|  | running on its core, especially when servicing interrupts and traps.  current is | 
|  | protected by a refcnt. | 
|  |  | 
|  | current does not say which process owns / will-run on a core.  The per-cpu | 
|  | variable 'owning_proc' covers that.  'owning_proc' should be treated like | 
|  | 'current' (aka, 'cur_proc') when it comes to reference counting.  Like all | 
|  | refcnts, you can use it, but you can't consume it without atomically either | 
|  | upping the refcnt or passing the reference (clearing the variable storing the | 
|  | reference).  Don't pass it to a function that will consume it and not return | 
|  | without upping it. | 
|  |  | 
|  | 1.4 Reference Counting Rules: | 
|  | --------------------------- | 
|  | +1 for existing. | 
|  | - The fact that the process is supposed to exist is worth +1.  When it is time | 
|  | to die, we decref, and it will eventually be cleaned up.  This existence is | 
|  | explicitly kref_put()d in proc_destroy(). | 
|  | - The hash table is a bit tricky.  We need to kref_get_not_zero() when it is | 
|  | locked, so we know we aren't racing with proc_free freeing the proc and | 
|  | removing it from the list.  After removing it from the hash, we don't need to | 
|  | kref_put it, since it was an internal ref.  The kref (i.e. external) isn't for | 
|  | being on the hash list, it's for existing.  This separation allows us to | 
|  | remove the proc from the hash list in the "release" function.  See kref.txt | 
|  | for more details. | 
|  |  | 
|  | +1 for someone using it or planning to use it. | 
|  | - This includes simply having a pointer to the proc, since presumably you will | 
|  | use it.  pid2proc() will incref for you.  When you are done, decref. | 
|  | - Functions that create a process and return a pointer (like proc_create() or | 
|  | kfs_proc_create()) will also up the refcnt for you.  Decref when you're done. | 
|  | - If the *proc is stored somewhere where it will be used again, such as in an IO | 
|  | continuation, it needs to be refcnt'd.  Note that if you already had a | 
|  | reference from pid2proc(), simply don't decref after you store the pointer. | 
|  |  | 
|  | +1 for current. | 
|  | - current counts as someone using it (expressing interest in the core), but is | 
|  | also a source of the pointer, so its a bit different.  Note that all kref's | 
|  | are sources of a pointer.  When we are running on a core that has current | 
|  | loaded, the ref is both for its usage as well as for being the current | 
|  | process. | 
|  | - You have a reference from current and can use it without refcnting, but | 
|  | anything that needs to eat a reference or store/use it needs an incref first. | 
|  | To be clear, your reference is *NOT* edible.  It protects the cr3, guarantees | 
|  | the process won't die, and serves as a bootstrappable reference. | 
|  | - Specifically, if you get a ref from current, but then save it somewhere (like | 
|  | an IO continuation request), then clearly you must incref, since it's both | 
|  | current and stored/used. | 
|  | - If you don't know what might be downstream from your function, then incref | 
|  | before passing the reference, and decref when it returns.  We used to do this | 
|  | for all syscalls, but now only do it for calls that might not return and | 
|  | expect to receive reference (like proc_yield). | 
|  |  | 
|  | All functions that take a *proc have a refcnt'd reference, though it may not be | 
|  | edible (it could be current).  It is the callers responsibility to make sure | 
|  | it'd edible if it necessary.  It is the callees responsibility to incref if it | 
|  | stores or makes a copy of the reference. | 
|  |  | 
|  | 1.5 Functions That Don't or Might Not Return: | 
|  | --------------------------- | 
|  | Refcnting and especially decreffing gets tricky when there are functions that | 
|  | MAY not return.  proc_restartcore() does not return (it pops into userspace). | 
|  | proc_run() used to not return, if the core it was called on would pop into | 
|  | userspace (if it was a _S, or if the core is part of the vcoremap for a _M). | 
|  | This doesn't happen anymore, since we have cur_ctx in the per-cpu info. | 
|  |  | 
|  | Functions that MAY not return will "eat" your reference *IF* they do not return. | 
|  | This means that you must have a reference when you call them (like always), and | 
|  | that reference will be consumed / decref'd for you if the function doesn't | 
|  | return.  Or something similarly appropriate. | 
|  |  | 
|  | Arguably, for functions that MAY not return, but will always be called with | 
|  | current's reference (proc_yield()), we could get away without giving it an | 
|  | edible reference, and then never eating the ref.  Yield needs to be reworked | 
|  | anyway, so it's not a bit deal yet. | 
|  |  | 
|  | We do this because when the function does not return, you will not have the | 
|  | chance to decref (your decref code will never run).  We need the reference when | 
|  | going in to keep the object alive (like with any other refcnt).  We can't have | 
|  | the function always eat the reference, since you cannot simply re-incref the | 
|  | pointer (not allowed to incref unless you know you had a good reference).  You'd | 
|  | have to do something like p = pid2proc(p_pid);  It's clunky to do that, easy to | 
|  | screw up, and semantically, if the function returns, then we may still have an | 
|  | interest in p and should decref later. | 
|  |  | 
|  | The downside is that functions need to determine if they will return or not, | 
|  | which can be a pain (for an out-of-date example: a linear time search when | 
|  | running an _M, for instance, which can suck if we are trying to use a | 
|  | broadcast/logical IPI). | 
|  |  | 
|  | As the caller, you usually won't know if the function will return or not, so you | 
|  | need to provide a consumable reference.  Current doesn't count.  For example, | 
|  | proc_run() requires a reference.  You can proc_run(p), and use p afterwards, and | 
|  | later decref.  You need to make sure you have a reference, so things like | 
|  | proc_run(pid2proc(55)) works, since pid2proc() increfs for you.  But you cannot | 
|  | proc_run(current), unless you incref current in advance.  Incidentally, | 
|  | proc_running current doesn't make a lot of sense. | 
|  |  | 
|  | 1.6 Runnable List: | 
|  | --------------------------- | 
|  | Procs on the runnable list need to have a refcnt (other than the +1 for | 
|  | existing).  It's something that cares that the process exists.  We could have | 
|  | had it implicitly be refcnt'd (the fact that it's on the list is enough, sort of | 
|  | as if it was part of the +1 for existing), but that complicates things.  For | 
|  | instance, it is a source of a reference (for the scheduler) and you could not | 
|  | proc_run() a process from the runnable list without worrying about increfing it | 
|  | before hand.  This isn't true anymore, but the runnable lists are getting | 
|  | overhauled anyway.  We'll see what works nicely. | 
|  |  | 
|  | 1.7 Internal Details for Specific Functions: | 
|  | --------------------------- | 
|  | proc_run()/__proc_give_cores(): makes sure enough refcnts are in place for all | 
|  | places that will install owning_proc.  This also makes it easier on the system | 
|  | (one big incref(n), instead of n increfs of (1) from multiple cores). | 
|  |  | 
|  | __set_proc_current() is a helper that makes sure p is the cur_proc.  It will | 
|  | incref if installing a new reference to p.  If it removed an old proc, it will | 
|  | decref. | 
|  |  | 
|  | __proc_startcore(): assumes all references to p are sorted.  It will not | 
|  | return, and you should not pass it a reference you need to decref().  Passing | 
|  | it 'owning_proc' works, since you don't want to decref owning_proc. | 
|  |  | 
|  | proc_destroy(): it used to not return, and back then if your reference was | 
|  | from 'current', you needed to incref.  Now that proc_destroy() returns, it | 
|  | isn't a big deal.  Just keep in mind that if you have a function that doesn't | 
|  | return, there's no way for the function to know if it's passed reference is | 
|  | edible.  Even if p == current, proc_destroy() can't tell if you sent it p (and | 
|  | had a reference) or current and didn't. | 
|  |  | 
|  | proc_yield(): when this doesn't return, it eats your reference.  It will also | 
|  | decref twice.  Once when it clears_owning_proc, and again when it calls | 
|  | abandon_core() (which clears cur_proc). | 
|  |  | 
|  | abandon_core(): it was not given a reference, so it doesn't eat one.  It will | 
|  | decref when it unloads the cr3.  Note that this is a potential performance | 
|  | issue.  When preempting or killing, there are n cores that are fighting for the | 
|  | cacheline to decref.  An alternative would be to have one core decref for all n | 
|  | cores, after it knows all cores unloaded the cr3.  This would be a good use of | 
|  | the checklist (possibly with one cacheline per core).  It would take a large | 
|  | amount of memory for better scalability. | 
|  |  | 
|  | 1.8 Things I Could Have Done But Didn't And Why: | 
|  | --------------------------- | 
|  | Q: Could we have the first reference (existence) mean it could be on the runnable | 
|  | list or otherwise in the proc system (but not other subsystems)?  In this case, | 
|  | proc_run() won't need to eat a reference at all - it will just incref for every | 
|  | current it will set up. | 
|  |  | 
|  | New A: Maybe, now that proc_run() returns. | 
|  |  | 
|  | Old A: No: if you pid2proc(), then proc_run() but never return, you have (and | 
|  | lose) an extra reference.  We need proc_run() to eat the reference when it | 
|  | does not return.  If you decref between pid2proc() and proc_run(), there's a | 
|  | (rare) race where the refcnt hits 0 by someone else trying to kill it.  While | 
|  | proc_run() will check to see if someone else is trying to kill it, there's a | 
|  | slight chance that the struct will be reused and recreated.  It'll probably | 
|  | never happen, but it could, and out of principle we shouldn't be referencing | 
|  | memory after it's been deallocated.  Avoiding races like this is one of the | 
|  | reasons for our refcnt discipline. | 
|  |  | 
|  | Q: (Moot) Could proc_run() always eat your reference, which would make it | 
|  | easier for its implementation? | 
|  |  | 
|  | A: Yeah, technically, but it'd be a pain, as mentioned above.  You'd need to | 
|  | reaquire a reference via pid2proc, and is rather easy to mess up. | 
|  |  | 
|  | Q: (Moot) Could we have made proc_destroy() take a flag, saying whether or not | 
|  | it was called on current and needed a decref instead of wasting an incref? | 
|  |  | 
|  | A: We could, but won't.  This is one case where the external caller is the one | 
|  | that knows the function needs to decref or not.  But it breaks the convention a | 
|  | bit, doesn't mirror proc_create() as well, and we need to pull in the cacheline | 
|  | with the refcnt anyways.  So for now, no. | 
|  |  | 
|  | Q: (Moot) Could we make __proc_give_cores() simply not return if an IPI is | 
|  | coming? | 
|  |  | 
|  | A: I did this originally, and manually unlocked and __wait_for_ipi()d.  Though | 
|  | we'd then need to deal with it like that for all of the related functions, which | 
|  | doesn't work if you wanted to do something afterwards (like schedule(p)).  Also | 
|  | these functions are meant to be internal helpers, so returning the bool makes | 
|  | more sense.  It eventually led to having __proc_unlock_ipi_pending(), which made | 
|  | proc_destroy() much cleaner and helped with a general model of dealing with | 
|  | these issues.  Win-win. | 
|  |  | 
|  | 2. When Do We Really Leave "Process Context"? | 
|  | =========================== | 
|  | 2.1 Overview | 
|  | --------------------------- | 
|  | First off, it's not really "process context" in the way Linux deals with it.  We | 
|  | aren't operating in kernel mode on behalf of the process (always).  We are | 
|  | specifically talking about when a process's cr3 is loaded on a core.  Usually, | 
|  | current is also set (the exception for now is when processing ARSCs). | 
|  |  | 
|  | There are a couple different ways to do this.  One is to never unload a context | 
|  | until something new is being run there (handled solely in __proc_startcore()). | 
|  | Another way is to always explicitly leave the core, like by abandon_core()ing. | 
|  |  | 
|  | The issue with the former is that you could have contexts sitting around for a | 
|  | while, and also would have a bit of extra latency when __proc_free()ing during | 
|  | someone *else's* __proc_startcore() (though that could be avoided if it becomes | 
|  | a real issue, via some form of reaping).  You'll also probably have excessive | 
|  | decrefs (based on the interactions between proc_run() and __startcore()). | 
|  |  | 
|  | The issue with the latter is excessive TLB shootdowns and corner cases.  There | 
|  | could be some weird cases (in core_request() for example) where the core you are | 
|  | running on has the context loaded for proc A on a mgmt core, but decides to give | 
|  | it to proc B. | 
|  |  | 
|  | If no process is running there, current == 0 and boot_cr3 is loaded, meaning no | 
|  | process's context is loaded. | 
|  |  | 
|  | All changes to cur_proc, owning_proc, and cur_ctx need to be done with | 
|  | interrupts disabled, since they change in interrupt handlers. | 
|  |  | 
|  | 2.2 Here's how it is done now: | 
|  | --------------------------- | 
|  | All code is capable of 'spamming' cur_proc (with interrupts disabled!).  If it | 
|  | is 0, feel free to set it to whatever process you want.  All code that | 
|  | requires current to be set will do so (like __proc_startcore()).  The | 
|  | smp_idle() path will make sure current is clear when it halts.  So long as you | 
|  | don't change other concurrent code's expectations, you're okay.  What I mean | 
|  | by that is you don't clear cur_proc while in an interrupt handler.  But if it | 
|  | is already 0, __startcore is allowed to set it to it's future proc (which is | 
|  | an optimization).  Other code didn't have any expectations of it (it was 0). | 
|  | Likewise, kthread code when we sleep_on() doesn't have to keep cur_proc set. | 
|  | A kthread is somewhat an isolated block (codewise), and leaving current set | 
|  | when it is done is solely to avoid a TLB flush (at the cost of an incref). | 
|  |  | 
|  | In general, we try to proactively leave process context, but have the ability | 
|  | to stay in context til __proc_startcore() to handle the corner cases (and to | 
|  | maybe cut down the TLB flushes later).  To stop proactively leaving, just | 
|  | change abandon_core() to not do anything with current/cr3.  You'll see weird | 
|  | things like processes that won't die until their old cores are reused.  The | 
|  | reason we proactively leave context is to help with sanity for these issues, | 
|  | and also to avoid decref's in __startcore(). | 
|  |  | 
|  | A couple other details: __startcore() sorts the extra increfs, and | 
|  | __proc_startcore() sorts leaving the old context.  Anytime a __startcore kernel | 
|  | message is sent, the sender increfs in advance for the owning_proc refcnt.  As | 
|  | an optimization, we can also incref to *attempt* to set current.  If current | 
|  | was 0, we set it.  If it was already something else, we failed and need to | 
|  | decref.  __proc_startcore(), which the last moment before we *must* have the | 
|  | cr3/current issues sorted, does the actual check if there was an old process | 
|  | there or not, while it handles the lcr3 (if necessary).  In general, lcr3's | 
|  | ought to have refcnts near them, or else comments explaining why not. | 
|  |  | 
|  | So we leave process context when told to do so (__death/abandon_core()) or if | 
|  | another process is run there.  The _M code is such that a proc will stay on its | 
|  | core until it receives a message, and that message would cleanup/restore a | 
|  | generic context (boot_cr3).  A _S could stay on its core until another _S came | 
|  | in.  This is much simpler for cases when a timer interrupt goes off to force a | 
|  | schedule() decision.  It also avoids a TLB flush in case the scheduler picked | 
|  | that same proc to run again.  This could also happen to an _M, if for some | 
|  | reason it was given a management core (!!!) or some other event happened that | 
|  | caused some management/scheduling function to run on one of it's cores (perhaps | 
|  | it asked). | 
|  |  | 
|  | proc_yield() abandons the core / leaves context. | 
|  |  | 
|  | 2.3 Other issues: | 
|  | --------------------------- | 
|  | Note that dealing with interrupting processes that are in the kernel is tricky. | 
|  | There is no true process context, so we can't leave a core until the kernel is | 
|  | in a "safe place", i.e. it's state is bundled enough that it can be recontinued | 
|  | later.  Calls of this type are routine kernel messages, executed at a convenient | 
|  | time (specifically, before we return to userspace in proc_restartcore(). | 
|  |  | 
|  | This same thing applies to __death messages.  Even though a process is dying, it | 
|  | doesn't mean we can just drop whatever the kernel was doing on its behalf.  For | 
|  | instance, it might be holding a reference that will never get decreffed if its | 
|  | stack gets dropped. | 
|  |  | 
|  | 3. Leaving the Kernel Stack: | 
|  | =========================== | 
|  | Just because a message comes in saying to kill a process, it does not mean we | 
|  | should immediately abandon_core().  The problem is more obvious when there is | 
|  | a preempt message, instead of a death message, but either way there is state | 
|  | that needs cleaned up (refcnts that need downed, etc). | 
|  |  | 
|  | The solution to this is rather simple: don't abandon right away.  That was | 
|  | always somewhat the plan for preemption, but was never done for death.  And | 
|  | there are several other cases to worry about too.  To enforce this, we expand | 
|  | the old "active messages" into a generic work execution message (a kernel | 
|  | message) that can be delayed or shipped to another core.  These types of | 
|  | messages will not be executed immediately on the receiving pcore - instead they | 
|  | are on the queue for "when there's nothing else to do in the kernel", which is | 
|  | checked in smp_idle() and before returning to userspace in proc_restartcore(). | 
|  | Additionally, these kernel messages can also be queued on an alarm queue, | 
|  | delaying their activation as part of a generic kernel alarm facility. | 
|  |  | 
|  | One subtlety is that __proc_startcore() shouldn't check for messages, since it | 
|  | is called by __startcore (a message).  Checking there would run the messages out | 
|  | of order, which is exactly what we are trying to avoid (total chaos).  No one | 
|  | should call __proc_startcore, other than proc_restartcore() or __startcore(). | 
|  | If we ever have functions that do so, if they are not called from a message, | 
|  | they must check for outstanding messages. | 
|  |  | 
|  | This last subtlety is why we needed to change proc_run()'s _S case to use a | 
|  | local message instead of calling proc_starcore (and why no one should ever call | 
|  | proc_startcore()).  We could unlock, thereby freeing another core to change the | 
|  | proc state and send a message to us, then try to proc_startcore, and then | 
|  | reading the message before we had installed current or had a userspace TF to | 
|  | preempt, and probably a few other things.  Treating _S as a local message is | 
|  | cleaner, begs to be merged in the code with _M's code, and uses the messaging | 
|  | infrastructure to avoid all the races that it was created to handle. | 
|  |  | 
|  | Incidentally, we don't need to worry about missing messages while trying to pop | 
|  | back to userspace from __proc_startcore, since an IPI will be on the way | 
|  | (possibly a self-ipi caused by the __kernel_message() handler).  This is also | 
|  | why we needed to make process_routine_kmsg() keep interrupts disabled when it | 
|  | stops (there's a race between checking the queue and disabling ints). | 
|  |  | 
|  | 4. Preemption and Notification Issues: | 
|  | =========================== | 
|  | 4.1: Message Ordering and Local Calls: | 
|  | --------------------------- | 
|  | Since we go with the model of cores being told what to do, there are issues | 
|  | with messages being received in the wrong order.  That is why we have the | 
|  | kernel messages (guaranteed, in-order delivery), with the proc-lock protecting | 
|  | the send order.  However, this is not enough for some rare races. | 
|  |  | 
|  | Local calls can also perform the same tasks as messages (calling | 
|  | proc_destroy() while a death IPI is on its way). We refer to these calls as | 
|  | messing with "local fate" (compared to global state (we're clever). | 
|  | Preempting a single vcore doesn't change the process's state).  These calls | 
|  | are a little different, because they also involve a check to see if it should | 
|  | perform the function or other action (e.g., death just idling and waiting for | 
|  | an IPI instead of trying to kill itself), instead of just blindly doing | 
|  | something. | 
|  |  | 
|  | 4.1.1: Possible Solutions | 
|  | ---------------- | 
|  | There are two ways to deal with this.  One (and the better one, I think) is to | 
|  | check state, and determine if it should proceed or abort.  This requires that | 
|  | all local-fate dependent calls always have enough state to do its job.  In the | 
|  | past, this meant that any function that results in sending a directive to a | 
|  | vcore store enough info in the proc struct that a local call can determine if | 
|  | it should take action or abort.  In the past, we used the vcore/pcoremap as a | 
|  | way to send info to the receiver about what vcore they are (or should be). | 
|  | Now, we store that info in pcpui (for '__startcore', we send it as a | 
|  | parameter.  Either way, the general idea is still true: local calls can | 
|  | proceed when they are called, and not self-ipi'd to a nebulous later time. | 
|  |  | 
|  | The other way is to send the work (including the checks) in a self-ipi kernel | 
|  | message.  This will guarantee that the message is executed after any existing | 
|  | messages (making the k_msg queue the authority for what should happen to a | 
|  | core).  The check is also performed later (when the k_msg executes).  There | 
|  | are a couple issues with this: if we allow the local core to send itself an | 
|  | k_msg that could be out of order (meaning it should not be sent, and is only | 
|  | sent due to ignorance of its sealed fate), AND if we return the core to the | 
|  | idle-core-list once its fate is sealed, we need to detect that the message is | 
|  | for the wrong process and that the process is in the wrong state.  To do this, | 
|  | we probably need local versioning on the pcore so it can detect that the | 
|  | message is late/wrong.  We might get by with just the proc* (though that is | 
|  | tricky with death and proc reuse), so long as we don't allow new startcores | 
|  | for a proc until AFTER the preemption is completed. | 
|  |  | 
|  | 4.2: Preempt-Served Flag | 
|  | ---------------- | 
|  | We want to be able to consider a pcore free once its owning proc has dealt | 
|  | with removing it.  This allows a scheduler-like function to easily take a core | 
|  | and then give it to someone else, without waiting for each vcore to respond, | 
|  | saying that the pcore is free/idle. | 
|  |  | 
|  | We used to not unmap until we were in '__preempt' or '__death', and we needed | 
|  | a flag to tell yield-like calls that a message was already on the way and to | 
|  | not rely on the vcoremap.  This is pretty fucked up for a number of reasons, | 
|  | so we changed that.  But we still wanted to know when a preempt was in | 
|  | progress so that the kernel could avoid giving out the vcore until the preempt | 
|  | was complete. | 
|  |  | 
|  | Here's the scenario: we send a '__startcore' to core 3 for VC5->PC3.  Then we | 
|  | quickly send a '__preempt' to 3, and then a '__startcore' to core 4 (a | 
|  | different pcore) for VC5->PC4.  Imagine all of this happens before the first | 
|  | '__startcore' gets processed (IRQ delay, fast ksched, whatever).  We need to | 
|  | not run the second '__startcore' on pcore 4 before the preemption has saved | 
|  | all of the state of the VC5.  So we spin on preempt_served (which may get | 
|  | renamed to preempt_in_progress).  We need to do this in the sender, and not | 
|  | the receiver (not in the kmsg), because the kmsgs can't tell which one they | 
|  | are.  Specifically, the first '__startcore' on core 3 runs the same code as | 
|  | the '__startcore' on core 4, working on the same vcore.  Anything we tell VC5 | 
|  | will be seen by both PC3 and PC4.  We'd end up deadlocking on PC3 while it | 
|  | spins waiting for the preempt message that also needs to run on PC3. | 
|  |  | 
|  | The preempt_pending flag is actual a timestamp, with the expiration time of | 
|  | the core at which the message will be sent.  We could try to use that, but | 
|  | since alarms aren't fired at exactly the time they are scheduled, the message | 
|  | might not actually be sent yet (though it will, really soon).  Still, we'll | 
|  | just go with the preempt-served flag for now. | 
|  |  | 
|  | 4.3: Impending Notifications | 
|  | ---------------- | 
|  | It's also possible that there is an impending notification.  There's no change | 
|  | in fate (though there could be a fate-changing preempt on its way), just the | 
|  | user wants a notification handler to run.  We need a flag anyways for this | 
|  | (discussed below), so proc_yield() or whatever other local call we have can | 
|  | check this flag as well. | 
|  |  | 
|  | Though for proc_yield(), it doesn't care if a notification is on its way (can | 
|  | be dependent on a flag to yield from userspace, based on the nature of the | 
|  | yield (which still needs to be sorted)).  If the yield is in response to a | 
|  | preempt_pending, it actually should yield and not receive the notification. | 
|  | So it should destroy its vcoreid->pcoreid mapping and abandon_core().  When | 
|  | that notification hits, it will be for a proc that isn't current, and will be | 
|  | ignored (it will get run the next time that vcore fires up, handled below). | 
|  |  | 
|  | There is a slight chance that the same proc will run on that pcore, but with a | 
|  | different vcoreid.  In the off chance this happens, the new vcore will get a | 
|  | spurious notification.  Userspace needs to be able to handle spurious | 
|  | notifications anyways, (there are a couple other cases, and in general it's | 
|  | not hard to do), so this is not a problem.  Instead of trying to have the | 
|  | kernel ignore the notification, we just send a spurious one.  A crappy | 
|  | alternative would be to send the vcoreid with the notification, but that would | 
|  | mean we can't send a generic message (broadcast) to a bunch of cores, which | 
|  | will probably be a problem later. | 
|  |  | 
|  | Note that this specific case is because the "local work message" gets | 
|  | processed out of order with respect to the notification.  And we want this in | 
|  | that case, since that proc_yield() is more important than the notification. | 
|  |  | 
|  | 4.4: Preemption / Allocation Phases and Alarm Delays | 
|  | --------------------------- | 
|  | A per-vcore preemption phase starts when the kernel marks the core's | 
|  | preempt_pending flag/counter and can includes the time when an alarm is | 
|  | waiting to go off to reclaim the core.  The phase ends when the vcore's pcore | 
|  | is reclaimed, either as a result of the kernel taking control, or because a | 
|  | process voluntarily yielded. | 
|  |  | 
|  | Specifically, the preempt_pending variable is actually a timestamp for when | 
|  | the core will be revoked (this assumes some form of global time, which we need | 
|  | anyways).  If its value is 0, then there is no preempt-pending, it is not in a | 
|  | phase, and the vcore can be given out again. | 
|  |  | 
|  | When a preempt alarm goes off, the alarm only means to check a process for | 
|  | expired vcores.  If the vcore has been yielded while the alarm was pending, | 
|  | the preempt_pending flag will be reset to 0.  To speed up the search for | 
|  | vcores to preempt, there's a circular buffer corelist in the proc struct, with | 
|  | vcoreids of potential suspects.  Or at least this will exist at some point. | 
|  | Also note that the preemption list isn't bound to a specific alarm: you can | 
|  | check the list at any time (not necessarily on a specific alarm), and you can | 
|  | have spurious alarms (the list is empty, so it'll be a noop). | 
|  |  | 
|  | Likewise, a global preemption phase is when an entire MCP is getting | 
|  | gang_prempted, and the global deadline is set.  A function can quickly check | 
|  | to see if the process responded, since the list of vcores with preemptions | 
|  | pending will be empty. | 
|  |  | 
|  | It seems obvious, but we do not allow allocation of a vcore during its | 
|  | preemption phase.  The main reason is that it can potentially break | 
|  | assumptions about the vcore->pcore mapping and can result in multiple | 
|  | instances of the same vcore on different pcores.  Imagine a preempt message | 
|  | sent to a pcore (after the alarm goes off), meanwhile that vcore/pcore yields | 
|  | and the vcore reactivates somewhere else.  There is a potential race on the | 
|  | vcore_ctx state: the new vcore is reading while the old is writing.  This | 
|  | issue is sorted naturally: the vcore entry in the vcoremap isn't cleared until | 
|  | the vcore/pcore is actually yielded/taken away, so the code looking for a free | 
|  | vcoreid slot will not try to use it. | 
|  |  | 
|  | Note that if we didn't design the alarm system to simply check for | 
|  | preemptions (perhaps it has a stored list of vcores to preempt), then we | 
|  | couldn't end the preempt-phase until the alarm was sorted.  If that is the | 
|  | case, we could easily give out a vcore that had been yielded but was still in | 
|  | a preempt-phase.  Stopping an alarm would be tricky too, since there could be | 
|  | lots of vcores in different states that need to be sorted by the alarm (so | 
|  | ripping it out isn't enough).  Setting a flag might not be enough either. | 
|  | Vcore version numbers/history (as well as global proc histories) is a pain I'd | 
|  | like to avoid too.  So don't change the alarm / delayed preemption system | 
|  | without thinking about this. | 
|  |  | 
|  | Also, allowing a vcore to restart while preemptions are pending also mucks | 
|  | with keeping the vcore mapping "old" (while the message is in flight).  A | 
|  | pcore will want to use that to determine which vcore is running on it.  It | 
|  | would be possible to keep a pcoremap for the reverse mapping out of sync, but | 
|  | that seems like a bad idea.  In general, having the pcoremap is a good idea | 
|  | (whenever we talk about a vcoremap, we're usually talking about both | 
|  | directions: "the vcore->pcore mapping"). | 
|  |  | 
|  | 4.5: Global Preemption Flags | 
|  | --------------------------- | 
|  | If we are trying to preempt an entire process at the same time, instead of | 
|  | playing with the circular buffer of vcores pending preemption, we could have a | 
|  | global timer as well.  This avoids some O(n) operations, though it means that | 
|  | userspace needs to check two "flags" (expiration dates) when grabbing its | 
|  | preempt-critical locks. | 
|  |  | 
|  | 4.6: Notifications Mixed with Preemption and Sleeping | 
|  | --------------------------- | 
|  | It is possible that notifications will mix with preemptions or come while a | 
|  | process is not running.  Ultimately, the process wants to be notified on a | 
|  | given vcore.  Whenever we send an active notification, we set a flag in procdata | 
|  | (notif_pending).  If the vcore is offline, we don't bother sending the IPI/notif | 
|  | message.  The kernel will make sure it runs the notification handler (as well as | 
|  | restoring the vcore_ctx) the next time that vcore is restarted.  Note that | 
|  | userspace can toggle this, so they can handle the notifications from a different | 
|  | core if it likes, or they can independently send a notification. | 
|  |  | 
|  | Note we use notif_pending to detect if an IPI was missed while notifs were | 
|  | disabled (this is done in pop_user_ctx() by userspace).  The overall meaning | 
|  | of notif_pending is that a vcore wants to be IPI'd.  The IPI could be | 
|  | in-flight, or it could be missed.  Since notification IPIs can be spurious, | 
|  | when we have potential races, we err on the side of sending.  This happens | 
|  | when pop_user_ctx() notifies itself, and when the kernel makes sure to start a | 
|  | vcore in vcore context if a notif was pending.  This was simplified a bit over | 
|  | the years by having uthreads always get saved into the uthread_ctx (formerly | 
|  | the notif_tf), instead of in the old preempt_tf (which is now the vcore_ctx). | 
|  |  | 
|  | If a vcore has a preempt_pending, we will still send the active notification | 
|  | (IPI).  The core ought to get a notification for the preemption anyway, so we | 
|  | need to be able to send one.  Additionally, once the vcore is handling that | 
|  | preemption notification, it will have notifs disabled, which will prevent us | 
|  | from sending any extra notifications anyways. | 
|  |  | 
|  | 4.7: Notifs While a Preempt Message is Served | 
|  | --------------------------- | 
|  | It is possible to have the kernel handling a notification k_msg and to have a | 
|  | preempt k_msg in the queue (preempt-served flag is set).  Ultimately, what we | 
|  | want is for the core to be preempted and the notification handler to run on | 
|  | the next execution.  Both messages are in the k_msg queue for "a convenient | 
|  | time to leave the kernel" (I'll have a better name for that later).  What we | 
|  | do is execute the notification handler and jump to userspace.  Since there is | 
|  | still an k_msg in the queue (and we self_ipi'd ourselves, it's part of how | 
|  | k_msgs work), the IPI will fire and push us right back into the kernel to | 
|  | execute the preemption, and the notif handler's context will be saved in the | 
|  | vcore_ctx (ready to go when the vcore gets started again). | 
|  |  | 
|  | We could try to just leave the notif_pending flag set and ignore the message, | 
|  | but that would involve inspecting the queue for the preempt k_msg. | 
|  | Additionally, a preempt k_msg can arrive anyway.  Finally, it's possible to have | 
|  | another message in the queue between the notif and the preempt, and it gets ugly | 
|  | quickly trying to determine what to do. | 
|  |  | 
|  | 4.8: When a Pcore is "Free" | 
|  | --------------------------- | 
|  | There are a couple ways to handle pcores.  One approach would be to not | 
|  | consider them free and able to be given to another process until the old | 
|  | process is completely removed (abandon_core()).  Another approach is to free | 
|  | the core once its fate is sealed (which we do).  This probably gives more | 
|  | flexibility in schedule()-like functions (no need to wait to give the core | 
|  | out), quicker dispatch latencies, less contention on shared structs (like the | 
|  | idle-core-map), etc. | 
|  |  | 
|  | This 'freeing' of the pcore is from the perspective of the kernel scheduler | 
|  | and the proc struct.  Contrary to all previous announcements, vcores are | 
|  | unmapped from pcores when sending k_msgs (technically right after), while | 
|  | holding the lock.  The pcore isn't actually not-running-the-proc until the | 
|  | kmsg completes and we abandon_core().  Previously, we used the vcoremap to | 
|  | communicate to other cores in a lock-free manner, but that was pretty shitty | 
|  | and now we just store the vcoreid in pcpu info. | 
|  |  | 
|  | Another tricky part is the seq_ctr used to signal userspace of changes to the | 
|  | coremap or num_vcores (coremap_seqctr).  While we may not even need this in the | 
|  | long run, it still seems like it could be useful.  The trickiness comes from | 
|  | when we update the seq_ctr when we are unmapping vcores on the receive side of a | 
|  | message (like __death or __preempt).  We'd rather not have each pcore contend on | 
|  | the seq_ctr cache line (let alone any locking) while they perform a somewhat | 
|  | data-parallel task.  So we continue to have the sending core handle the seq_ctr | 
|  | upping and downing.  This works, since the "unlocking" happens after messages | 
|  | are sent, which means the receiving core is no longer in userspace (if there is | 
|  | a delay, it is because the remote core is in the kernel, possibly with | 
|  | interrupts disabled).  Because of this, userspace will be unable to read the new | 
|  | value of the seq_ctr before the IPI hits and does the unmapping that the seq_ctr | 
|  | protects/advertises.  This is most likely true.  It wouldn't be if the "last IPI | 
|  | was sent" flag clears before the IPI actually hit the other core. | 
|  |  | 
|  | 4.9: Future Broadcast/Messaging Needs | 
|  | --------------------------- | 
|  | Currently, messaging is serialized.  Broadcast IPIs exist, but the kernel | 
|  | message system is based on adding an k_msg to a list in a pcore's | 
|  | per_cpu_info.  Further the sending of these messages is in a loop.  In the | 
|  | future, we would like to have broadcast messaging of some sort (literally a | 
|  | broadcast, like the IPIs, and if not that, then a communication tree of | 
|  | sorts). | 
|  |  | 
|  | In the past, (OLD INFO): given those desires, we wanted to make sure that no | 
|  | message we send needs details specific to a pcore (such as the vcoreid running | 
|  | on it, a history number, or anything like that).  Thus no k_msg related to | 
|  | process management would have anything that cannot apply to the entire | 
|  | process.  At this point, most just have a struct proc *.  A pcore was be able | 
|  | to figure out what is happening based on the pcoremap, information in the | 
|  | struct proc, and in the preempt struct in procdata. | 
|  |  | 
|  | In more recent revisions, the coremap no longer is meant to be used across | 
|  | kmsgs, so some messages ('__startcore') send the vcoreid.  This means we can't | 
|  | easily broadcast the message.  However, many broadcast mechanisms wouldn't | 
|  | handle '__startcore' naturally.  For instance, logical IPIs need something | 
|  | already set in the LAPIC, or maybe need to be sent to a somewhat predetermined | 
|  | group (again, bits in the LAPIC).  If we tried this for '__startcore', we | 
|  | could add something in to the messaging to carry these vcoreids.  More likely, | 
|  | we'll have a broadcast tree.  Keeping vcoreid (or any arg) next to whoever | 
|  | needs to receive the message is a very small amount of bookkeeping on a struct | 
|  | that already does bookkeeping. | 
|  |  | 
|  | 4.10: Other Things We Thought of but Don't Like | 
|  | --------------------------- | 
|  | All local fate-related work is sent as a self k_msg, to enforce ordering. | 
|  | It doesn't capture the difference between a local call and a remote k_msg. | 
|  | The k_msg has already considered state and made its decision.  The local call | 
|  | is an attempt.  It is also unnecessary, if we put in enough information to | 
|  | make a decision in the proc struct.  Finally, it caused a few other problems | 
|  | (like needing to detect arbitrary stale messages). | 
|  |  | 
|  | Overall message history: doesn't work well when you do per-core stuff, since | 
|  | it will invalidate other messages for the process.  We then though of a pcore | 
|  | history counter to detect stale messages.  Don't like that either.  We'd have | 
|  | to send the history in the message, since it's a per-message, per-core | 
|  | expiration.  There might be other ways around this, but this doesn't seem | 
|  | necessary. | 
|  |  | 
|  | Alarms have pointers to a list of which cores should be preempted when that | 
|  | specific alarm goes off (saved with the alarm).  Ugh.  It gets ugly with | 
|  | multiple outstanding preemptions and cores getting yielded while the alarms | 
|  | sleep (and possibly could get reallocated, though we'd make a rule to prevent | 
|  | that).  Like with notifications, being able to handle spurious alarms and | 
|  | thinking of an alarm as just a prod to check somewhere is much more flexible | 
|  | and simple.  It is similar to generic messages that have the actual important | 
|  | information stored somewhere else (as with allowing broadcasts, with different | 
|  | receivers performing slightly different operations). | 
|  |  | 
|  | Synchrony for messages (wanting a response to a preempt k_msg, for example) | 
|  | sucks.  Just encode the state of impending fate in the proc struct, where it | 
|  | belongs.  Additionally, we don't want to hold the proc lock even longer than | 
|  | we do now (which is probably too long as it is).  Finally, it breaks a golden | 
|  | rule: never wait while holding a lock: you will deadlock the system (e.g. if | 
|  | the receiver is already in the kernel spinning on the lock).  We'd have to | 
|  | send messages, unlock (which might cause a message to hit the calling pcore, | 
|  | as in the case of locally called proc_destroy()), and in the meantime some | 
|  | useful invariant might be broken. | 
|  |  | 
|  | We also considered using the transition stack as a signal that a process is in | 
|  | a notification handler.  The kernel can inspect the stack pointer to determine | 
|  | this.  It's possible, but unnecessary. | 
|  |  | 
|  | Using the pcoremap as a way to pass info with kmsgs: it worked a little, but | 
|  | had some serious problems, as well as making life difficult.  It had two | 
|  | purposes: help with local fate calls (yield) and allow broadcast messaging. | 
|  | The main issue is that it was using a global struct to pass info with | 
|  | messages, but it was based on the snapshot of state at the time the message | 
|  | was sent.  When you send a bunch of messages, certain state may have changed | 
|  | between messages, and the old snapshot isn't there anymore by the time the | 
|  | message gets there.  To avoid this, we went through some hoops and had some | 
|  | fragile code that would use other signals to avoid those scenarios where the | 
|  | global state change would send the wrong message.  It was tough to understand, | 
|  | and not clear it was correct (hint: it wasn't).  Here's an example (on one | 
|  | pcore): if we send a preempt and we then try to map that pcore to another | 
|  | vcore in the same process before the preempt call checks its pcoremap, we'll | 
|  | clobber the pcore->vcore mapping (used by startcore) and the preempt will | 
|  | think it is the new vcore, not the one it was when the message was sent. | 
|  | While this is a bit convoluted, I can imagine a ksched doing this, and | 
|  | perhaps with weird IRQ delays, the messages might get delayed enough for it to | 
|  | happen.  I'd rather not have to have the ksched worry about this just because | 
|  | proc code was old and ghetto.  Another reason we changed all of this was so | 
|  | that you could trust the vcoremap while holding the lock.  Otherwise, it's | 
|  | actually non-trivial to know the state of a vcore (need to check a combination | 
|  | of preempt_served and is_mapped), and even if you do that, there are some | 
|  | complications with doing this in the ksched. | 
|  |  | 
|  | 5. current_ctx and owning_proc | 
|  | =========================== | 
|  | Originally, current_tf was a per-core macro that returns a struct trapframe * | 
|  | that points back on the kernel stack to the user context that was running on | 
|  | the given core when an interrupt or trap happened.  Saving the reference to | 
|  | the TF helps simplify code that needs to do something with the TF (like save | 
|  | it and pop another TF).  This way, we don't need to pass the context all over | 
|  | the place, especially through code that might not care. | 
|  |  | 
|  | Then, current_tf was more broadly defined as the user context that should be | 
|  | run when the kernel is ready to run a process.  In the older case, it was when | 
|  | the kernel tries to return to userspace from a trap/interrupt.  current_tf | 
|  | could be set by an IPI/KMSG (like '__startcore') so that when the kernel wants | 
|  | to idle, it will find a current_tf that it needs to run, even though we never | 
|  | trapped in on that context in the first place. | 
|  |  | 
|  | Finally, current_tf was changed to current_ctx, and instead of tracking a | 
|  | struct trapframe (equivalent to a hw_trapframe), it now tracked a struct | 
|  | user_context, which could be either a HW or a SW trapframe. | 
|  |  | 
|  | Further, we now have 'owning_proc', which tells the kernel which process | 
|  | should be run.  'owning_proc' is a bigger deal than 'current_ctx', and it is | 
|  | what tells us to run cur_ctx. | 
|  |  | 
|  | Process management KMSGs now simply modify 'owning_proc' and cur_ctx, as if we | 
|  | had interrupted a process.  Instead of '__startcore' forcing the kernel to | 
|  | actually run the process and trapframe, it will just mean we will eventually | 
|  | run it.  In the meantime a '__notify' or a '__preempt' can come in, and they | 
|  | will apply to the owning_proc/cur_ctx.  This greatly simplifies process code | 
|  | and code calling process code (like the scheduler), since we no longer need to | 
|  | worry about whether or not we are getting a "stack killing" kernel message. | 
|  | Before this, code needed to care where it was running when managing _Ms. | 
|  |  | 
|  | Note that neither 'current_ctx' nor 'owning_proc' rely on 'current'/'cur_proc'. | 
|  | 'current' is just what process context we're in, not what process (and which | 
|  | trapframe) we will eventually run. | 
|  |  | 
|  | cur_ctx does not point to kernel trapframes, which is important when we | 
|  | receive an interrupt in the kernel.  At one point, we were (hypothetically) | 
|  | clobbering the reference to the user trapframe, and were unable to recover. | 
|  | We can get away with this because the kernel always returns to its previous | 
|  | context from a nested handler (via iret on x86). | 
|  |  | 
|  | In the future, we may need to save kernel contexts and may not always return | 
|  | via iret.  At which point, if the code path is deep enough that we don't want | 
|  | to carry the TF pointer, we may revisit this.  Until then, current_ctx is just | 
|  | for userspace contexts, and is simply stored in per_cpu_info. | 
|  |  | 
|  | Brief note from the future (months after this paragraph was written): cur_ctx | 
|  | has two aspects/jobs: | 
|  | 1) tell the kernel what we should do (trap, fault, sysc, etc), how we came | 
|  | into the kernel (the fact that it is a user tf), which is why we copy-out | 
|  | early on | 
|  | 2) be a vehicle for us to restart the process/vcore | 
|  |  | 
|  | We've been focusing on the latter case a lot, since that is what gets | 
|  | removed when preempted, changed during a notify, created during a startcore, | 
|  | etc.  Don't forget it was also an instruction of sorts.  The former case is | 
|  | always true throughout the life of the syscall.  The latter only happens to be | 
|  | true throughout the life of a *non-blocking* trap since preempts are routine | 
|  | KMSGs.  But if we block in a syscall, the cur_ctx is no longer the TF we came | 
|  | in on (and possibly the one we are asked to operate on), and that old cur_ctx | 
|  | has probably restarted. | 
|  |  | 
|  | (Note that cur_ctx is a pointer, and syscalls/traps actually operate on the TF | 
|  | they came in on regardless of what happens to cur_ctx or pcpui->actual_tf.) | 
|  |  | 
|  | 6. Locking! | 
|  | =========================== | 
|  | 6.1: proc_lock | 
|  | --------------------------- | 
|  | Currently, all locking is done on the proc_lock.  It's main goal is to protect | 
|  | the vcore mapping (vcore->pcore and vice versa).  As of Apr 2010, it's also used | 
|  | to protect changes to the address space and the refcnt.  Eventually the refcnt | 
|  | will be handled with atomics, and the address space will have it's own MM lock. | 
|  |  | 
|  | We grab the proc_lock all over the place, but we try to avoid it whereever | 
|  | possible - especially in kernel messages or other places that will be executed | 
|  | in parallel.  One place we do grab it but would like to not is in proc_yield(). | 
|  | We don't always need to grab the proc lock.  Here are some examples: | 
|  |  | 
|  | 6.1.1: Lockless Notifications: | 
|  | ------------- | 
|  | We don't lock when sending a notification.  We want the proc_lock to not be an | 
|  | irqsave lock (discussed below).  Since we might want to send a notification from | 
|  | interrupt context, we can't grab the proc_lock if it's a regular lock. | 
|  |  | 
|  | This is okay, since the proc_lock is only protecting the vcoremapping.  We could | 
|  | accidentally send the notification to the wrong pcore.  The __notif handler | 
|  | checks to make sure it is the right process, and all _M processes should be able | 
|  | to handle spurious notifications.  This assumes they are still _M. | 
|  |  | 
|  | If we send it to the wrong pcore, there is a danger of losing the notif, since | 
|  | it didn't go to the correct vcore.  That would happen anyway, (the vcore is | 
|  | unmapped, or in the process of mapping).  The notif_pending flag will be caught | 
|  | when the vcore is started up next time (and that flag was set before reading the | 
|  | vcoremap). | 
|  |  | 
|  | 6.1.2: Local get_vcoreid(): | 
|  | ------------- | 
|  | It's not necessary to lock while checking the vcoremap if you are checking for | 
|  | the core you are running on (e.g. pcoreid == core_id()).  This is because all | 
|  | unmappings of a vcore are done on the receive side of a routine kmsg, and that | 
|  | code cannot run concurrently with the code you are running. | 
|  |  | 
|  | 6.2: irqsave | 
|  | --------------------------- | 
|  | The proc_lock used to be an irqsave lock (meaning it disables interrupts and can | 
|  | be grabbed from interrupt context).  We made it a regular lock for a couple | 
|  | reasons.  The immediate one was it was causing deadlocks due to some other | 
|  | ghetto things (blocking on the frontend server, for instance).  More generally, | 
|  | we don't want to disable interrupts for long periods of time, so it was | 
|  | something worth doing anyway. | 
|  |  | 
|  | This means that we cannot grab the proc_lock from interrupt context.  This | 
|  | includes having schedule called from an interrupt handler (like the | 
|  | timer_interrupt() handler), since it will call proc_run.  Right now, we actually | 
|  | do this, which we shouldn't, and that will eventually get fixed.  The right | 
|  | answer is that the actual work of running the scheduler should be a routine | 
|  | kmsg, similar to how Linux sets a bit in the kernel that it checks on the way | 
|  | out to see if it should run the scheduler or not. | 
|  |  | 
|  | 7. TLB Coherency | 
|  | =========================== | 
|  | When changing or removing memory mappings, we need to do some form of a TLB | 
|  | shootdown.  Normally, this will require sending an IPI (immediate kmsg) to | 
|  | every vcore of a process to unmap the affected page.  Before allocating that | 
|  | page back out, we need to make sure that every TLB has been flushed. | 
|  |  | 
|  | One reason to use a kmsg over a simple handler is that we often want to pass a | 
|  | virtual address to flush for those architectures (like x86) that can | 
|  | invalidate a specific page.  Ideally, we'd use a broadcast kmsg (doesn't exist | 
|  | yet), though we already have simple broadcast IPIs. | 
|  |  | 
|  | 7.1 Initial Stuff | 
|  | --------------------------- | 
|  | One big issue is whether or not to wait for a response from the other vcores | 
|  | that they have unmapped.  There are two concerns: 1) Page reuse and 2) User | 
|  | semantics.  We cannot give out the physical page while it may still be in a | 
|  | TLB (even to the same process.  Ask us about the pthread_test bug). | 
|  |  | 
|  | The second case is a little more detailed.  The application may not like it if | 
|  | it thinks a page is unmapped or protected, and it does not generate a fault. | 
|  | I am less concerned about this, especially since we know that even if we don't | 
|  | wait to hear from every vcore, we know that the message was delivered and the | 
|  | IPI sent.  Any cores that are in userspace will have trapped and eventually | 
|  | handle the shootdown before having a chance to execute other user code.  The | 
|  | delays in the shootdown response are due to being in the kernel with | 
|  | interrupts disabled (it was an IMMEDIATE kmsg). | 
|  |  | 
|  | 7.2 RCU | 
|  | --------------------------- | 
|  | One approach is similar to RCU.  Unmap the page, but don't put it on the free | 
|  | list.  Instead, don't reallocate it until we are sure every core (possibly | 
|  | just affected cores) had a chance to run its kmsg handlers.  This time is | 
|  | similar to the RCU grace periods.  Once the period is over, we can then truly | 
|  | free the page. | 
|  |  | 
|  | This would require some sort of RCU-like mechanism and probably a per-core | 
|  | variable that has the timestamp of the last quiescent period.  Code caring | 
|  | about when this page (or pages) can be freed would have to check on all of the | 
|  | cores (probably in a bitmask for what needs to be freed).  It would make sense | 
|  | to amortize this over several RCU-like operations. | 
|  |  | 
|  | 7.3 Checklist | 
|  | --------------------------- | 
|  | It might not suck that much to wait for a response if you already sent an IPI, | 
|  | though it incurs some more cache misses.  If you wanted to ensure all vcores | 
|  | ran the shootdown handler, you'd have them all toggle their bit in a checklist | 
|  | (unused for a while, check smp.c).  The only one who waits would be the | 
|  | caller, but there still are a bunch of cache misses in the handlers.  Maybe | 
|  | this isn't that big of a deal, and the RCU thing is an unnecessary | 
|  | optimization. | 
|  |  | 
|  | 7.4 Just Wait til a Context Switch | 
|  | --------------------------- | 
|  | Another option is to not bother freeing the page until the entire process is | 
|  | descheduled.  This could be a very long time, and also will mess with | 
|  | userspace's semantics.  They would be running user code that could still | 
|  | access the old page, so in essence this is a lazy munmap/mprotect.  The | 
|  | process basically has the page in pergatory: it can't be reallocated, and it | 
|  | might be accessible, but can't be guaranteed to work. | 
|  |  | 
|  | The main benefit of this is that you don't need to send the TLB shootdown IPI | 
|  | at all - so you don't interfere with the app.  Though in return, they have | 
|  | possibly weird semantics.  One aspect of these weird semantics is that the | 
|  | same virtual address could map to two different pages - that seems like a | 
|  | disaster waiting to happen.  We could also block that range of the virtual | 
|  | address space from being reallocated, but that gets even more tricky. | 
|  |  | 
|  | One issue with just waiting and RCU is memory pressure.  If we actually need | 
|  | the page, we will need to enforce an unmapping, which sucks a little. | 
|  |  | 
|  | 7.5 Bulk vs Single | 
|  | --------------------------- | 
|  | If there are a lot of pages being shot down, it'd be best to amortize the cost | 
|  | of the kernel messages, as well as the invlpg calls (single page shootdowns). | 
|  | One option would be for the kmsg to take a range, and not just a single | 
|  | address.  This would help with bulk munmap/mprotects.  Based on the number of | 
|  | these, perhaps a raw tlbflush (the entire TLB) would be worth while, instead | 
|  | of n single shots.  Odds are, that number is arch and possibly workload | 
|  | specific. | 
|  |  | 
|  | For now, the plan will be to send a range and have them individually shot | 
|  | down. | 
|  |  | 
|  | 7.6 Don't do it | 
|  | --------------------------- | 
|  | Either way, munmap/mprotect sucks in an MCP.  I recommend not doing it, and | 
|  | doing the appropriate mmap/munmap/mprotects in _S mode.  Unfortunately, even | 
|  | our crap pthread library munmaps on demand as threads are created and | 
|  | destroyed.  The vcore code probably does in the bowels of glibc's TLS code | 
|  | too, though at least that isn't on every user context switch. | 
|  |  | 
|  | 7.7 Local memory | 
|  | --------------------------- | 
|  | Private local memory would help with this too.  If each vcore has its own | 
|  | range, we won't need to send TLB shootdowns for those areas, and we won't have | 
|  | to worry about weird application semantics.  The downside is we would need to | 
|  | do these mmaps in certain ranges in advance, and might not easily be able to | 
|  | do them remotely.  More on this when we actually design and build it. | 
|  |  | 
|  | 7.8 Future Hardware Support | 
|  | --------------------------- | 
|  | It would be cool and interesting if we had the ability to remotely shootdown | 
|  | TLBs.  For instance, all cores with cr3 == X, shootdown range Y..Z.  It's | 
|  | basically what we'll do with the kernel message and the vcoremap, but with | 
|  | magic hardware. | 
|  |  | 
|  | 7.9 Current Status | 
|  | --------------------------- | 
|  | For now, we just send a kernel message to all vcores to do a full TLB flush, | 
|  | and not to worry about checklists, waiting, or anything.  This is due to being | 
|  | short on time and not wanting to sort out the issue with ranges.  The way | 
|  | it'll get changed to is to send the kmsg with the range to the appropriate | 
|  | cores, and then maybe put the page on the end of the freelist (instead of the | 
|  | head).  More to come. | 
|  |  | 
|  | 8. Process Management | 
|  | =========================== | 
|  | 8.1 Vcore lists | 
|  | --------------------------- | 
|  | We have three lists to track a process's vcores.  The vcores themselves sit in | 
|  | the vcoremap in procinfo; they aren't dynamically allocated (memory) or | 
|  | anything like that.  The lists greatly eases vcore discovery and management. | 
|  |  | 
|  | A vcore is on exactly one of three lists: online (mapped and running vcores, | 
|  | sometimes called 'active'), bulk_preempt (was online when the process was bulk | 
|  | preempted (like a timeslice)), and inactive (yielded, hasn't come on yet, | 
|  | etc).  When writes are complete (unlocked), either the online list or the | 
|  | bulk_preempt list should be empty. | 
|  |  | 
|  | List modifications are protected by the proc_lock.  You can concurrently read, | 
|  | but note you may get some weird behavior, such as a vcore on multiple lists, a | 
|  | vcore on no lists, online and bulk_preempt both having items, etc.  Currently, | 
|  | event code will read these lists when hunting for a suitable core, and will | 
|  | have to be careful about races.  I want to avoid event FALLBACK code from | 
|  | grabbing the proc_lock. | 
|  |  | 
|  | Another slight thing to be careful of is that the vcore lists don't always | 
|  | agree with the vcore mapping.  However, it will always agree with what the | 
|  | state of the process will be when all kmsgs are processed (fate). | 
|  | Specifically, when we take vcores, the unmapping happens with the lock not | 
|  | held on the vcore itself (as discussed elsewhere).  The vcore lists represent | 
|  | the result of those pending unmaps. | 
|  |  | 
|  | Before we used the lists, we scanned the vcoremap in a painful, clunky manner. | 
|  | In the old style, when you asked for a vcore, the first one you got was the | 
|  | first hole in the vcoremap.  Ex: Vcore0 would always be granted if it was | 
|  | offline.  That's no longer true; the most recent vcore yielded will be given | 
|  | out next.  This will help with cache locality, and also cuts down on the | 
|  | scenarios on which the kernel gives out a vcore that userspace wasn't | 
|  | expecting.  This can still happen if they ask for more vcores than they set up | 
|  | for, or if a vcore doesn't *want* to come online (there's a couple scenarios | 
|  | with preemption recovery where that may come up). | 
|  |  | 
|  | So the plan with the bulk preempt list is that vcores on it were preempted, | 
|  | and the kernel will attempt to restart all of them (and move them to the online | 
|  | list).  Any leftovers will be moved to the inactive list, and have preemption | 
|  | recovery messages sent out.  Any shortages (they want more vcores than were | 
|  | bulk_preempted) will be taken from the yield list.  This all means that | 
|  | whether or not a vcore needs to be preempt-recovered or if there is a message | 
|  | out about its preemption doesn't really affect which list it is on.  You could | 
|  | have a vcore on the inactive list that was bulk preempted (and not turned back | 
|  | on), and then that vcore gets granted in the next round of vcore_requests(). | 
|  | The preemption recovery handlers will need to deal with concurrent handlers | 
|  | and the vcore itself starting back up. | 
|  |  | 
|  | 9. On the Ordering of Messages and Bugs with Old State | 
|  | =========================== | 
|  | This is a sordid tale involving message ordering, message delivery times, and | 
|  | finding out (sometimes too late) that the state you expected is gone and | 
|  | having to deal with that error. | 
|  |  | 
|  | A few design issues: | 
|  | - being able to send messages and have them execute in the order they are | 
|  | sent | 
|  | - having message handlers resolve issues with global state.  Some need to know | 
|  | the correct 'world view', and others need to know what was the state at the | 
|  | time they were sent. | 
|  | - realizing syscalls, traps, faults, and any non-IRQ entry into the kernel is | 
|  | really a message. | 
|  |  | 
|  | Process management messages have alternated from ROUTINE to IMMEDIATE and now | 
|  | back to ROUTINE.  These messages include such family favorites as | 
|  | '__startcore', '__preempt', etc.  Meanwhile, syscalls were coming in that | 
|  | needed to know about the core and the process's state (specifically, yield, | 
|  | change_to, and get_vcoreid).  Finally, we wanted to avoid locking, esp in | 
|  | KMSGs handlers (imagine all cores grabbing the lock to check the vcoremap or | 
|  | something). | 
|  |  | 
|  | Incidentally, events were being delivered concurretly to vcores, though that | 
|  | actually didn't matter much (check out async_events.txt for more on that). | 
|  |  | 
|  | 9.1: Design Guidelines | 
|  | --------------------------- | 
|  | Initially, we wanted to keep broadcast messaging available as an option.  As | 
|  | noted elsewhere, we can't really do this well for startcore, since most | 
|  | hardware broadcast options need some initial per-core setup, and any sort of | 
|  | broadcast tree we make should be able to handle a small message.  Anyway, this | 
|  | desire in the early code to keep all messages identical lead to a few | 
|  | problems. | 
|  |  | 
|  | Another objective of the kernel messaging was to avoid having the message | 
|  | handlers grab any locks, especially the same lock (the proc lock is used to | 
|  | protect the vcore map, for instance). | 
|  |  | 
|  | Later on, a few needs popped up that motivated the changes discussed below: | 
|  | - Being able to find out which proc/vcore was on a pcore | 
|  | - Not having syscalls/traps require crazy logic if the carpet was pulled out | 
|  | from under them. | 
|  | - Having proc management calls return.  This one was sorted out by making all | 
|  | kmsg handlers return.  It would be a nightmare making a ksched without this. | 
|  |  | 
|  | 9.2: Looking at Old State: a New Bug for an Old Problem | 
|  | --------------------------- | 
|  | We've always had issues with syscalls coming in and already had the fate of a | 
|  | core determined.  This is referred to in a few places as "predetermined fate" | 
|  | vs "local state".  A remote lock holder (ksched) already determined a core | 
|  | should be unmapped and sent a message.  Only later does some call like | 
|  | proc_yield() realize its core is already *unmapped*. (I use that term poorly | 
|  | here).  This sort of code had to realize it was working on an old version of | 
|  | state and just abort.  This was usually safe, though looking at the vcoremap | 
|  | was a bad idea.  Initially, we used preempt_served as the signal, which was | 
|  | okay.  Around 12b06586 yield started to use the vcoremap, which turned out to | 
|  | be wrong. | 
|  |  | 
|  | A similar issue happens for the vcore messages (startcore, preempt, etc).  The | 
|  | way startcore used to work was that it would only know what pcore it was on, | 
|  | and then look into the vcoremap to figure out what vcoreid it should be | 
|  | running.  This was to keep broadcast messaging available as an option.  The | 
|  | problem with it is that the vcoremap may have changed between when the | 
|  | messages were sent and when they were executed.  Imagine a startcore followed | 
|  | by a preempt, afterwhich the vcore was unmapped.  Well, to get around that, we | 
|  | had the unmapping happen in the preempt or death handlers.  Yikes!  This was | 
|  | the case back in the early days of ROS.  This meant the vcoremap wasn't | 
|  | actually representative of the decisions the ksched made - we also needed to | 
|  | look at the state we'd have after all outstanding messages executed.  And this | 
|  | would differ from the vcore lists (which were correct for a lock holder). | 
|  |  | 
|  | This was managable for a little while, until I tried to conclusively know who | 
|  | owned a particular pcore.  This came up while making a provisioning scheduler. | 
|  | Given a pcore, tell me which process/vcore (if any) were on it.  It was rather | 
|  | tough.  Getting the proc wasn't too hard, but knowing which vcore was a little | 
|  | tougher.  (Note the ksched doesn't care about which vcore is running, and the | 
|  | process can change vcores on a pcore at will).  But once you start looking at | 
|  | the process, you can't tell which vcore a certain pcore has.  The vcoremap may | 
|  | be wrong, since a preempt is already on the way.  You would have had to scan | 
|  | the vcore lists to see if the proc code thought that vcore was online or not | 
|  | (which would mean there had been no preempts).  This is the pain I was talking | 
|  | about back around commit 5343a74e0. | 
|  |  | 
|  | So I changed things so that the vcoremap was always correct for lock holders, | 
|  | and used pcpui to track owning_vcoreid (for preempt/notify), and used an extra | 
|  | KMSG variable to tell startcore which vcoreid it should use.  In doing so, we | 
|  | (re)created the issue that the delayed unmapping dealt with: the vcoremap | 
|  | would represent *now*, and not the vcoremap of when the messages were first | 
|  | sent.  However, this had little to do with the KMSGs, which I was originally | 
|  | worried about.  No one was looking at the vcoremap without the lock, so the | 
|  | KMSGs were okay, but remember: syscalls are like messages too.  They needed to | 
|  | figure out what vcore they were on, i.e. what vcore userspace was making | 
|  | requests on (viewing a trap/fault as a type of request). | 
|  |  | 
|  | Now the problem was that we were using the vcoremap to figure out which vcore | 
|  | we were supposed to be.  When a syscall finally ran, the vcoremap could be | 
|  | completely wrong, and with immediate KMSGs (discussed below), the pcpui was | 
|  | already changed!  We dealt with the problem for KMSGs, but not syscalls, and | 
|  | basically reintroduced the bug of looking at current state and thinking it | 
|  | represented the state from when the 'message' was sent (when we trapped into | 
|  | the kernel, for a syscall/exception). | 
|  |  | 
|  | 9.3: Message Delivery, Circular Waiting, and Having the Carpet Pulled Out | 
|  | --------------------------- | 
|  | In-order message delivery was what drove me to build the kernel messaging | 
|  | system in the first place.  It provides in-order messages to a particular | 
|  | pcore.  This was enough for a few scenarios, such as preempts racing ahead of | 
|  | startcores, or deaths racing a head of preempts, etc.  However, I also wanted | 
|  | an ordering of messages related to a particular vcore, and this wasn't | 
|  | apparent early on. | 
|  |  | 
|  | The issue first popped up with a startcore coming quickly on the heals of a | 
|  | preempt for the same VC, but on different PCs.  The startcore cannot proceed | 
|  | until the preempt saved the TF into the VCPD.  The old way of dealing with | 
|  | this was to spin in '__map_vcore()'.  This was problematic, since it meant we | 
|  | were spinning while holding a lock, and resulted in some minor bugs and issues | 
|  | with lock ordering and IRQ disabling (couldn't disable IRQs and then try to | 
|  | grab the lock, since the lock holder could have sent you a message and is | 
|  | waiting for you to handle the IRQ/IMMED KMSG).  However, it was doable.  But | 
|  | what wasn't doable was to have the KMSGs be ROUTINE.  Any syscalls that tried | 
|  | to grab the proc lock (lots of them) would deadlock, since the lock holder was | 
|  | waiting on us to handle the preempt (same circular waiting issue as above). | 
|  |  | 
|  | This was fine, albeit subpar, until a new issue showed up.  Sending IMMED | 
|  | KMSGs worked fine if we were coming from userspace already, but if we were in | 
|  | the kernel, those messages would run immediately (hence the name), just like | 
|  | an IRQ handler, and could confuse syscalls that touched cur_ctx/pcpui.  If a | 
|  | preempt came in during a syscall, the process/vcore could be changed before | 
|  | the syscall took place.  Some syscalls could handle this, albeit poorly. | 
|  | sys_proc_yield() and sys_change_vcore() delicately tried to detect if they | 
|  | were still mapped or not and use that to determine if a preemption happened. | 
|  |  | 
|  | As mentioned above, looking at the vcoremap only tells you what is currently | 
|  | happening, and not what happened in the past.  Specifically, it doesn't tell | 
|  | you the state of the mapping when a particular core trapped into the kernel | 
|  | for a syscall (referred to as when the 'message' was sent up above).  Imagine | 
|  | sys_get_vcoreid(): you trap in, then immediately get preempted, then startcore | 
|  | for the same process but a different vcoreid.  The syscall would return with | 
|  | the vcoreid of the new vcore, since it cannot tell there was a change.  The | 
|  | async syscall would complete and we'd have a wrong answer.  While this never | 
|  | happened to me, I had a similar issue while debugging some other bugs (I'd get | 
|  | a vcoreid of 0xdeadbeef, for instance, which was the old poison value for an | 
|  | unmapped vcoreid).  There are a bunch of other scenarios that trigger similar | 
|  | disasters, and they are very hard to avoid. | 
|  |  | 
|  | One way out of this was a per-core history counter, that changed whenever we | 
|  | changed cur_ctx.  Then when we trapped in for a syscall, we could save the | 
|  | value, enable_irqs(), and go about our business.  Later on, we'd have to | 
|  | disable_irqs() and compare the counters.  If they were different, we'd have to | 
|  | bail out some how.  This could have worked for change_to and yield, and some | 
|  | others.  But any syscall that wanted to operate on cur_ctx in some way would | 
|  | fail (imagine a hypothetical sys_change_stack_pointer()).  The context that | 
|  | trapped has already returned on another core.  I guess we could just fail that | 
|  | syscall, though it seems a little silly to not be able to do that. | 
|  |  | 
|  | The previous example was a bit contrived, but lets also remember that it isn't | 
|  | just syscalls: all exceptions have the same issue.  Faults might be fixable, | 
|  | since if you restart a faulting context, it will start on the faulting | 
|  | instruction.  However all traps (like syscall) restart on the next | 
|  | instruction.  Hope we don't want to do anything fancy with breakpoint!  Note | 
|  | that I had breakpointing contexts restart on other pcores and continue while I | 
|  | was in the breakpoint handler (noticed while I was debugging some bugs with | 
|  | lots of preempts).  Yikes.  And don't forget we eventually want to do some | 
|  | complicated things with the page fault handler, and may want to turn on | 
|  | interrupts / kthread during a page fault (imaging hitting disk).  Yikes. | 
|  |  | 
|  | So I looked into going back to ROUTINE kernel messages.  With ROUTINE | 
|  | messages, I didn't have to worry about having the carpet pulled out from under | 
|  | syscalls and exceptions (traps, faults, etc).  The 'carpet' is stuff like | 
|  | cur_ctx, owning_proc, owning_vcoreid, etc.  We still cannot trust the vcoremap, | 
|  | unless we *know* there were no preempts or other KMSGs waiting for us. | 
|  | (Incidentally, in the recent fix a93aa7559, we merely use the vcoremap as a | 
|  | sanity check). | 
|  |  | 
|  | However, we can't just switch back to ROUTINEs.  Remember: with ROUTINEs, | 
|  | we will deadlock in '__map_vcore()', when it waits for the completion of | 
|  | preempt.  Ideally, we would have had startcore spin on the signal.  Since we | 
|  | already gave up on using x86-style broadcast IPIs for startcore (in | 
|  | 5343a74e0), we might as well pass along a history counter, so it knows to wait | 
|  | on preempt. | 
|  |  | 
|  | 9.4: The Solution | 
|  | --------------------------- | 
|  | To fix up all of this, we now detect preemptions in syscalls/traps and order | 
|  | our kernel messages with two simple per-vcore counters.  Whenever we send a | 
|  | preempt, we up one counter.  Whenever that preempt finishes, it ups another | 
|  | counter.  When we send out startcores, we send a copy of the first counter. | 
|  | This is a way of telling startcore where it belongs in the list of messages. | 
|  | More specifically, it tells it which preempt happens-before it. | 
|  |  | 
|  | Basically, I wanted a partial ordering on my messages, so that messages sent | 
|  | to a particular vcore are handled in the order they were sent, even if those | 
|  | messages run on different physical cores. | 
|  |  | 
|  | It is not sufficient to use a seq counter (one integer, odd values for | 
|  | 'preempt in progress' and even values for 'preempt done').  It is possible to | 
|  | have multiple preempts in flight for the same vcore, albeit with startcores in | 
|  | between.  Still, there's no way to encode that scenario in just one counter. | 
|  |  | 
|  | Here's a normal example of traffic to some vcore.  I note both the sending and | 
|  | the execution of the kmsgs: | 
|  | nr_pre_sent    nr_pre_done    pcore     message sent/status | 
|  | ------------------------------------------------------------- | 
|  | 0              0              X         startcore (nr_pre_sent == 0) | 
|  | 0              0              X         startcore (executes) | 
|  | 1              0              X         preempt   (kmsg sent) | 
|  | 1              1              Y         preempt   (executes) | 
|  | 1              1              Y         startcore (nr_pre_sent == 1) | 
|  | 1              1              Y         startcore (executes) | 
|  |  | 
|  | Note the messages are always sent by the lockholder in the order of the | 
|  | example above. | 
|  |  | 
|  | Here's when the startcore gets ahead of the prior preempt: | 
|  | nr_pre_sent    nr_pre_done    pcore     message sent/status | 
|  | ------------------------------------------------------------- | 
|  | 0              0              X         startcore (nr_pre_sent == 0) | 
|  | 0              0              X         startcore (executes) | 
|  | 1              0              X         preempt   (kmsg sent) | 
|  | 1              0              Y         startcore (nr_pre_sent == 1) | 
|  | 1              1              X         preempt   (executes) | 
|  | 1              1              Y         startcore (executes) | 
|  |  | 
|  | Note that this can only happen across cores, since KMSGs to a particular core | 
|  | are handled in order (for a given class of message).  The startcore blocks on | 
|  | the prior preempt. | 
|  |  | 
|  | Finally, here's an example of what a seq ctr can't handle: | 
|  | nr_pre_sent    nr_pre_done    pcore     message sent/status | 
|  | ------------------------------------------------------------- | 
|  | 0              0              X         startcore (nr_pre_sent == 0) | 
|  | 1              0              X         preempt   (kmsg sent) | 
|  | 1              0              Y         startcore (nr_pre_sent == 1) | 
|  | 2              0              Y         preempt   (kmsg sent) | 
|  | 2              0              Z         startcore (nr_pre_sent == 2) | 
|  | 2              1              X         preempt   (executes (upped to 1)) | 
|  | 2              1              Y         startcore (executes (needed 1)) | 
|  | 2              2              Y         preempt   (executes (upped to 2)) | 
|  | 2              Z              Z         startcore (executes (needed 2)) | 
|  |  | 
|  | As a nice bonus, it is easy for syscalls that care about the vcoreid (yield, | 
|  | change_to, get_vcoreid) to check if they have a preempt_served.  Just grab the | 
|  | lock (to prevent further messages being sent), then check the counters.  If | 
|  | they are equal, there is no preempt on its way.  This actually was the | 
|  | original way we checked for preempts in proc_yield back in the day.  It was | 
|  | just called preempt_served.  Now, it is split into two counters, instead of | 
|  | just being a bool. | 
|  |  | 
|  | Regardless of whether or not we were preempted, we still can look at | 
|  | pcpui->owning_proc and owning_vcoreid to figure out what the vcoreid of the | 
|  | trap/syscall is, and we know that the cur_ctx is still the correct cur_ctx (no | 
|  | carpet pulled out), since while there could be a preempt ROUTINE message | 
|  | waiting for us, we simply haven't run it yet.  So calls like yield should | 
|  | still fail (since your core has been unmapped and you need to bail out and run | 
|  | the preempt handler), but calls like sys_change_stack_pointer can proceed. | 
|  | More importantly than that old joke syscall, the page fault handler can try to | 
|  | do some cool things without worrying about really crazy stuff. | 
|  |  | 
|  | 9.5: Why We (probably) Don't Deadlock | 
|  | --------------------------- | 
|  | It's worth thinking about why this setup of preempts and startcores can't | 
|  | deadlock.  Anytime we spin in the kernel, we ought to do this.  Perhaps there | 
|  | is some issue with other KMSGs for other processes, or other vcores, or | 
|  | something like that that can cause a deadlock. | 
|  |  | 
|  | Hypothetical case: pcore 1 has a startcore for vc1 which is stuck behind vc2's | 
|  | startcore on PC2, with time going upwards.  In these examples, startcores are | 
|  | waiting on particular preempts, subject to the nr_preempts_sent parameter sent | 
|  | along with the startcores. | 
|  |  | 
|  | ^ | 
|  | |            _________                 _________ | 
|  | |           |         |               |         | | 
|  | |           | pr vc 2 |               | pr vc 1 | | 
|  | |           |_________|               |_________| | 
|  | | | 
|  | |            _________                 _________ | 
|  | |           |         |               |         | | 
|  | |           | sc vc 1 |               | sc vc 2 | | 
|  | |           |_________|               |_________| | 
|  | t | 
|  | --------------------------------------------------------------------------- | 
|  | ______                    ______ | 
|  | |      |                  |      | | 
|  | | PC 1 |                  | PC 2 | | 
|  | |______|                  |______| | 
|  |  | 
|  | Here's the same picture, but with certain happens-before arrows.  We'll use X --> Y to | 
|  | mean X happened before Y, was sent before Y.  e.g., a startcore is sent after | 
|  | a preempt. | 
|  |  | 
|  | ^ | 
|  | |            _________                 _________ | 
|  | |           |         |               |         | | 
|  | |       .-> | pr vc 2 | --.    .----- | pr vc 1 | <-. | 
|  | |       |   |_________|    \  /   &   |_________|   | | 
|  | |     * |                   \/                      | * | 
|  | |       |    _________      /\         _________    | | 
|  | |       |   |         |    /  \   &   |         |   | | 
|  | |       '-- | sc vc 1 | <-'    '----> | sc vc 2 | --' | 
|  | |           |_________|               |_________| | 
|  | t | 
|  | --------------------------------------------------------------------------- | 
|  | ______                    ______ | 
|  | |      |                  |      | | 
|  | | PC 1 |                  | PC 2 | | 
|  | |______|                  |______| | 
|  |  | 
|  | The arrows marked with * are ordered like that due to the property of KMSGs, | 
|  | in that we have in order delivery.  Messages are executed in the order in | 
|  | which they were sent (serialized with a spinlock btw), so on any pcore, | 
|  | messages that are further ahead in the queue were sent before (and thus will | 
|  | be run before) other messages. | 
|  |  | 
|  | The arrows marked with a & are ordered like that due to how the proc | 
|  | management code works.  The kernel won't send out a startcore for a particular | 
|  | vcore before it sent out a preempt.  (Note that techincally, preempts follow | 
|  | startcores.  The startcores in this example are when we start up a vcore after | 
|  | it had been preempted in the past.). | 
|  |  | 
|  | Anyway, note that we have a cycle, where all events happened before each | 
|  | other, which isn't possible.  The trick to connecting "unrelated" events like | 
|  | this (unrelated meaning 'not about the same vcore') in a happens-before manner | 
|  | is the in-order properties of the KMSGs. | 
|  |  | 
|  | Based on this example, we can derive general rules.  Note that 'sc vc 2' could | 
|  | be any kmsg that waits on another message placed behind 'sc vc 1'.  This would | 
|  | require us having sent a KMSG that waits on a KMSGs that we send later.  Bad | 
|  | idea!  (you could have sent that KMSGs to yourself, aside from just being | 
|  | dangerous).  If you want to spin, make sure you send the work that should | 
|  | happen-before actually-before the waiter. | 
|  |  | 
|  | In fact, we don't even need 'sc vc 2' to be a KMSG.  It could be miscellaneous | 
|  | kernel code, like a proc mgmt syscall.  Imagine if we did something like the | 
|  | old '__map_vcore' call from within the ksched.  That would be code that holds | 
|  | the lock, and then waits on the execution of a message handler.  That would | 
|  | deadlock (which is why we don't do it anymore). | 
|  |  | 
|  | Finally, in case this isn't clear, all of the startcores and preempts for | 
|  | a given vcore exist in a happens-before relation, both in sending and in | 
|  | execution.  The sending aspect is handled by proc mgmt code.  For execution, | 
|  | preempts always follow startcores due to the KMSG ordering property.  For | 
|  | execution of startcores, startcores always spin until the preempt they follow | 
|  | is complete, ensuring the execution of the main part of their handler happens | 
|  | after the prior preempt. | 
|  |  | 
|  | Here's some good ideas for the ordering of locks/irqs/messages: | 
|  | - You can't hold a spinlock of any sort and then wait on a routine kernel | 
|  | message.  The core where that runs may be waiting on you, or some scenario | 
|  | like above. | 
|  | - Similarly, think about how this works with kthreads.  A kthread | 
|  | restart is a routine KMSG.  You shouldn't be waiting on code that | 
|  | could end up kthreading, mostly because those calls block! | 
|  | - You can hold a spinlock and wait on an IMMED kmsg, if the waiters of the | 
|  | spinlock have irqs enabled while spinning (this is what we used to do with | 
|  | the proc lock and IMMED kmsgs, and 54c6008 is an example of doing it wrong) | 
|  | - As a corollary, locks like this cannot be irqsave, since the other | 
|  | attempted locker will have irq disabled | 
|  | - For broadcast trees, you'd have to send IMMEDs for the intermediates, and | 
|  | then it'd be okay to wait on those intermediate, immediate messages (if we | 
|  | wanted confirmation of the posting of RKM) | 
|  | - The main thing any broadcast mechanism needs to do is make sure all | 
|  | messages get delivered in order to particular pcores (the central | 
|  | premise of KMSGs) (and not deadlock due to waiting on a KMSG | 
|  | improperly) | 
|  | - Alternatively, we could use routines for the intermediates if we didn't want | 
|  | to wait for RKMs to hit their destination, we'd need to always use the same | 
|  | proxy for the same destination pcore, e.g., core 16 always covers 16-31. | 
|  | - Otherwise, we couldn't guarantee the ordering of SC before PR before | 
|  | another SC (which the proc_lock and proc mgmt code does); we need the | 
|  | ordering of intermediate msgs on the message queues of a particular | 
|  | core. | 
|  | - All kmsgs would need to use this broadcasting style (couldn't mix | 
|  | regular direct messages with broadcast), so odds are this style would | 
|  | be of limited use. | 
|  | - since we're not waiting on execution of a message, we could use RKMs | 
|  | (while holding a spinlock) | 
|  | - There might be some bad effects with kthreads delaying the reception of RKMS | 
|  | for a while, but probably not catastrophically. | 
|  |  | 
|  | 9.6: Things That We Don't Handle Nicely | 
|  | --------------------------- | 
|  | If for some reason a syscall or fault handler blocks *unexpectedly*, we could | 
|  | have issues.  Imagine if change_to happens to block in some early syscall code | 
|  | (like instrumentation, or who knows what, that blocks in memory allocation). | 
|  | When the syscall kthread restarts, its old cur_ctx is gone.  It may or may not | 
|  | be running on a core owned by the original process.  If it was, we probably | 
|  | would accidentally yield that vcore (clearly a bug). | 
|  |  | 
|  | For now, any of these calls that care about cur_ctx/pcpui need to not block | 
|  | without some sort of protection.  None of them do, but in the future we might | 
|  | do something that causes them to block.  We could deal with it by having a | 
|  | pcpu or per-kthread/syscall flag that says if it ever blocked, and possibly | 
|  | abort.  We get into similar nasty areas as with preempts, but this time, we | 
|  | can't solve it by making preempt a routine KMSG - we block as part of that | 
|  | syscall/handler code.  Odds are, we'll just have to outlaw this, now and | 
|  | forever.  Just note that if a syscall/handler blocks, the TF it came in on is | 
|  | probably not cur_ctx any longer, and that old cur_ctx has probably restarted. | 
|  |  | 
|  | 10. TBD | 
|  | =========================== |