| process-internals.txt |
| Barret Rhoden |
| |
| This discusses core issues with process design and implementation. Most of this |
| info is available in the source in the comments (but may not be in the future). |
| For now, it's a dumping ground for topics that people ought to understand before |
| they muck with how processes work. |
| |
| Contents: |
| 1. Reference Counting |
| 2. When Do We Really Leave "Process Context"? |
| 3. Leaving the Kernel Stack |
| 4. Preemption and Notification Issues |
| 5. current_ctx and owning_proc |
| 6. Locking! |
| 7. TLB Coherency |
| 8. Process Management |
| 9. On the Ordering of Messages |
| 10. TBD |
| |
| 1. Reference Counting |
| =========================== |
| 1.1 Basics: |
| --------------------------- |
| Reference counts exist to keep a process alive. We use krefs for this, similar |
| to Linux's kref: |
| - Can only incref if the current value is greater than 0, meaning there is |
| already a reference to it. It is a bug to try to incref on something that has |
| no references, so always make sure you incref something that you know has a |
| reference. If you don't know, you need to get it from pid2proc (which is a |
| careful way of doing this - pid2proc kref_get_not_zero()s on the reference that is |
| stored inside it). If you incref and there are 0 references, the kernel will |
| panic. Fix your bug / don't incref random pointers. |
| - Can always decref. |
| - When the decref returns 0, perform some operation. This does some final |
| cleanup on the object. |
| - Process code is trickier since we frequently make references from 'current' |
| (which isn't too bad), but also because we often do not return and need to be |
| careful about the references we passed in to a no-return function. |
| |
| 1.2 Brief History of the Refcnt: |
| --------------------------- |
| Originally, the refcnt was created to keep page tables from being destroyed (in |
| proc_free()) while cores were still using them, which is what was happens during |
| an ARSC (async remote syscall). It was then defined to be a count of places in |
| the kernel that had an interest in the process staying alive, practically just |
| to protect current/cr3. This 'interest' actually extends to any code holding a |
| pointer to the proc, such as one acquired via pid2proc(), which is its current |
| meaning. |
| |
| 1.3 Quick Aside: The current Macro: |
| --------------------------- |
| current is a pointer to the proc that is currently loaded/running on any given |
| core. It is stored in the per_cpu_info struct, and set/managed by low-level |
| process code. It is necessary for the kernel to quickly figure out who is |
| running on its core, especially when servicing interrupts and traps. current is |
| protected by a refcnt. |
| |
| current does not say which process owns / will-run on a core. The per-cpu |
| variable 'owning_proc' covers that. 'owning_proc' should be treated like |
| 'current' (aka, 'cur_proc') when it comes to reference counting. Like all |
| refcnts, you can use it, but you can't consume it without atomically either |
| upping the refcnt or passing the reference (clearing the variable storing the |
| reference). Don't pass it to a function that will consume it and not return |
| without upping it. |
| |
| 1.4 Reference Counting Rules: |
| --------------------------- |
| +1 for existing. |
| - The fact that the process is supposed to exist is worth +1. When it is time |
| to die, we decref, and it will eventually be cleaned up. This existence is |
| explicitly kref_put()d in proc_destroy(). |
| - The hash table is a bit tricky. We need to kref_get_not_zero() when it is |
| locked, so we know we aren't racing with proc_free freeing the proc and |
| removing it from the list. After removing it from the hash, we don't need to |
| kref_put it, since it was an internal ref. The kref (i.e. external) isn't for |
| being on the hash list, it's for existing. This separation allows us to |
| remove the proc from the hash list in the "release" function. See kref.txt |
| for more details. |
| |
| +1 for someone using it or planning to use it. |
| - This includes simply having a pointer to the proc, since presumably you will |
| use it. pid2proc() will incref for you. When you are done, decref. |
| - Functions that create a process and return a pointer (like proc_create() or |
| kfs_proc_create()) will also up the refcnt for you. Decref when you're done. |
| - If the *proc is stored somewhere where it will be used again, such as in an IO |
| continuation, it needs to be refcnt'd. Note that if you already had a |
| reference from pid2proc(), simply don't decref after you store the pointer. |
| |
| +1 for current. |
| - current counts as someone using it (expressing interest in the core), but is |
| also a source of the pointer, so its a bit different. Note that all kref's |
| are sources of a pointer. When we are running on a core that has current |
| loaded, the ref is both for its usage as well as for being the current |
| process. |
| - You have a reference from current and can use it without refcnting, but |
| anything that needs to eat a reference or store/use it needs an incref first. |
| To be clear, your reference is *NOT* edible. It protects the cr3, guarantees |
| the process won't die, and serves as a bootstrappable reference. |
| - Specifically, if you get a ref from current, but then save it somewhere (like |
| an IO continuation request), then clearly you must incref, since it's both |
| current and stored/used. |
| - If you don't know what might be downstream from your function, then incref |
| before passing the reference, and decref when it returns. We used to do this |
| for all syscalls, but now only do it for calls that might not return and |
| expect to receive reference (like proc_yield). |
| |
| All functions that take a *proc have a refcnt'd reference, though it may not be |
| edible (it could be current). It is the callers responsibility to make sure |
| it'd edible if it necessary. It is the callees responsibility to incref if it |
| stores or makes a copy of the reference. |
| |
| 1.5 Functions That Don't or Might Not Return: |
| --------------------------- |
| Refcnting and especially decreffing gets tricky when there are functions that |
| MAY not return. proc_restartcore() does not return (it pops into userspace). |
| proc_run() used to not return, if the core it was called on would pop into |
| userspace (if it was a _S, or if the core is part of the vcoremap for a _M). |
| This doesn't happen anymore, since we have cur_ctx in the per-cpu info. |
| |
| Functions that MAY not return will "eat" your reference *IF* they do not return. |
| This means that you must have a reference when you call them (like always), and |
| that reference will be consumed / decref'd for you if the function doesn't |
| return. Or something similarly appropriate. |
| |
| Arguably, for functions that MAY not return, but will always be called with |
| current's reference (proc_yield()), we could get away without giving it an |
| edible reference, and then never eating the ref. Yield needs to be reworked |
| anyway, so it's not a bit deal yet. |
| |
| We do this because when the function does not return, you will not have the |
| chance to decref (your decref code will never run). We need the reference when |
| going in to keep the object alive (like with any other refcnt). We can't have |
| the function always eat the reference, since you cannot simply re-incref the |
| pointer (not allowed to incref unless you know you had a good reference). You'd |
| have to do something like p = pid2proc(p_pid); It's clunky to do that, easy to |
| screw up, and semantically, if the function returns, then we may still have an |
| interest in p and should decref later. |
| |
| The downside is that functions need to determine if they will return or not, |
| which can be a pain (for an out-of-date example: a linear time search when |
| running an _M, for instance, which can suck if we are trying to use a |
| broadcast/logical IPI). |
| |
| As the caller, you usually won't know if the function will return or not, so you |
| need to provide a consumable reference. Current doesn't count. For example, |
| proc_run() requires a reference. You can proc_run(p), and use p afterwards, and |
| later decref. You need to make sure you have a reference, so things like |
| proc_run(pid2proc(55)) works, since pid2proc() increfs for you. But you cannot |
| proc_run(current), unless you incref current in advance. Incidentally, |
| proc_running current doesn't make a lot of sense. |
| |
| 1.6 Runnable List: |
| --------------------------- |
| Procs on the runnable list need to have a refcnt (other than the +1 for |
| existing). It's something that cares that the process exists. We could have |
| had it implicitly be refcnt'd (the fact that it's on the list is enough, sort of |
| as if it was part of the +1 for existing), but that complicates things. For |
| instance, it is a source of a reference (for the scheduler) and you could not |
| proc_run() a process from the runnable list without worrying about increfing it |
| before hand. This isn't true anymore, but the runnable lists are getting |
| overhauled anyway. We'll see what works nicely. |
| |
| 1.7 Internal Details for Specific Functions: |
| --------------------------- |
| proc_run()/__proc_give_cores(): makes sure enough refcnts are in place for all |
| places that will install owning_proc. This also makes it easier on the system |
| (one big incref(n), instead of n increfs of (1) from multiple cores). |
| |
| __set_proc_current() is a helper that makes sure p is the cur_proc. It will |
| incref if installing a new reference to p. If it removed an old proc, it will |
| decref. |
| |
| __proc_startcore(): assumes all references to p are sorted. It will not |
| return, and you should not pass it a reference you need to decref(). Passing |
| it 'owning_proc' works, since you don't want to decref owning_proc. |
| |
| proc_destroy(): it used to not return, and back then if your reference was |
| from 'current', you needed to incref. Now that proc_destroy() returns, it |
| isn't a big deal. Just keep in mind that if you have a function that doesn't |
| return, there's no way for the function to know if it's passed reference is |
| edible. Even if p == current, proc_destroy() can't tell if you sent it p (and |
| had a reference) or current and didn't. |
| |
| proc_yield(): when this doesn't return, it eats your reference. It will also |
| decref twice. Once when it clears_owning_proc, and again when it calls |
| abandon_core() (which clears cur_proc). |
| |
| abandon_core(): it was not given a reference, so it doesn't eat one. It will |
| decref when it unloads the cr3. Note that this is a potential performance |
| issue. When preempting or killing, there are n cores that are fighting for the |
| cacheline to decref. An alternative would be to have one core decref for all n |
| cores, after it knows all cores unloaded the cr3. This would be a good use of |
| the checklist (possibly with one cacheline per core). It would take a large |
| amount of memory for better scalability. |
| |
| 1.8 Things I Could Have Done But Didn't And Why: |
| --------------------------- |
| Q: Could we have the first reference (existence) mean it could be on the runnable |
| list or otherwise in the proc system (but not other subsystems)? In this case, |
| proc_run() won't need to eat a reference at all - it will just incref for every |
| current it will set up. |
| |
| New A: Maybe, now that proc_run() returns. |
| |
| Old A: No: if you pid2proc(), then proc_run() but never return, you have (and |
| lose) an extra reference. We need proc_run() to eat the reference when it |
| does not return. If you decref between pid2proc() and proc_run(), there's a |
| (rare) race where the refcnt hits 0 by someone else trying to kill it. While |
| proc_run() will check to see if someone else is trying to kill it, there's a |
| slight chance that the struct will be reused and recreated. It'll probably |
| never happen, but it could, and out of principle we shouldn't be referencing |
| memory after it's been deallocated. Avoiding races like this is one of the |
| reasons for our refcnt discipline. |
| |
| Q: (Moot) Could proc_run() always eat your reference, which would make it |
| easier for its implementation? |
| |
| A: Yeah, technically, but it'd be a pain, as mentioned above. You'd need to |
| reaquire a reference via pid2proc, and is rather easy to mess up. |
| |
| Q: (Moot) Could we have made proc_destroy() take a flag, saying whether or not |
| it was called on current and needed a decref instead of wasting an incref? |
| |
| A: We could, but won't. This is one case where the external caller is the one |
| that knows the function needs to decref or not. But it breaks the convention a |
| bit, doesn't mirror proc_create() as well, and we need to pull in the cacheline |
| with the refcnt anyways. So for now, no. |
| |
| Q: (Moot) Could we make __proc_give_cores() simply not return if an IPI is |
| coming? |
| |
| A: I did this originally, and manually unlocked and __wait_for_ipi()d. Though |
| we'd then need to deal with it like that for all of the related functions, which |
| doesn't work if you wanted to do something afterwards (like schedule(p)). Also |
| these functions are meant to be internal helpers, so returning the bool makes |
| more sense. It eventually led to having __proc_unlock_ipi_pending(), which made |
| proc_destroy() much cleaner and helped with a general model of dealing with |
| these issues. Win-win. |
| |
| 2. When Do We Really Leave "Process Context"? |
| =========================== |
| 2.1 Overview |
| --------------------------- |
| First off, it's not really "process context" in the way Linux deals with it. We |
| aren't operating in kernel mode on behalf of the process (always). We are |
| specifically talking about when a process's cr3 is loaded on a core. Usually, |
| current is also set (the exception for now is when processing ARSCs). |
| |
| There are a couple different ways to do this. One is to never unload a context |
| until something new is being run there (handled solely in __proc_startcore()). |
| Another way is to always explicitly leave the core, like by abandon_core()ing. |
| |
| The issue with the former is that you could have contexts sitting around for a |
| while, and also would have a bit of extra latency when __proc_free()ing during |
| someone *else's* __proc_startcore() (though that could be avoided if it becomes |
| a real issue, via some form of reaping). You'll also probably have excessive |
| decrefs (based on the interactions between proc_run() and __startcore()). |
| |
| The issue with the latter is excessive TLB shootdowns and corner cases. There |
| could be some weird cases (in core_request() for example) where the core you are |
| running on has the context loaded for proc A on a mgmt core, but decides to give |
| it to proc B. |
| |
| If no process is running there, current == 0 and boot_cr3 is loaded, meaning no |
| process's context is loaded. |
| |
| All changes to cur_proc, owning_proc, and cur_ctx need to be done with |
| interrupts disabled, since they change in interrupt handlers. |
| |
| 2.2 Here's how it is done now: |
| --------------------------- |
| All code is capable of 'spamming' cur_proc (with interrupts disabled!). If it |
| is 0, feel free to set it to whatever process you want. All code that |
| requires current to be set will do so (like __proc_startcore()). The |
| smp_idle() path will make sure current is clear when it halts. So long as you |
| don't change other concurrent code's expectations, you're okay. What I mean |
| by that is you don't clear cur_proc while in an interrupt handler. But if it |
| is already 0, __startcore is allowed to set it to it's future proc (which is |
| an optimization). Other code didn't have any expectations of it (it was 0). |
| Likewise, kthread code when we sleep_on() doesn't have to keep cur_proc set. |
| A kthread is somewhat an isolated block (codewise), and leaving current set |
| when it is done is solely to avoid a TLB flush (at the cost of an incref). |
| |
| In general, we try to proactively leave process context, but have the ability |
| to stay in context til __proc_startcore() to handle the corner cases (and to |
| maybe cut down the TLB flushes later). To stop proactively leaving, just |
| change abandon_core() to not do anything with current/cr3. You'll see weird |
| things like processes that won't die until their old cores are reused. The |
| reason we proactively leave context is to help with sanity for these issues, |
| and also to avoid decref's in __startcore(). |
| |
| A couple other details: __startcore() sorts the extra increfs, and |
| __proc_startcore() sorts leaving the old context. Anytime a __startcore kernel |
| message is sent, the sender increfs in advance for the owning_proc refcnt. As |
| an optimization, we can also incref to *attempt* to set current. If current |
| was 0, we set it. If it was already something else, we failed and need to |
| decref. __proc_startcore(), which the last moment before we *must* have the |
| cr3/current issues sorted, does the actual check if there was an old process |
| there or not, while it handles the lcr3 (if necessary). In general, lcr3's |
| ought to have refcnts near them, or else comments explaining why not. |
| |
| So we leave process context when told to do so (__death/abandon_core()) or if |
| another process is run there. The _M code is such that a proc will stay on its |
| core until it receives a message, and that message would cleanup/restore a |
| generic context (boot_cr3). A _S could stay on its core until another _S came |
| in. This is much simpler for cases when a timer interrupt goes off to force a |
| schedule() decision. It also avoids a TLB flush in case the scheduler picked |
| that same proc to run again. This could also happen to an _M, if for some |
| reason it was given a management core (!!!) or some other event happened that |
| caused some management/scheduling function to run on one of it's cores (perhaps |
| it asked). |
| |
| proc_yield() abandons the core / leaves context. |
| |
| 2.3 Other issues: |
| --------------------------- |
| Note that dealing with interrupting processes that are in the kernel is tricky. |
| There is no true process context, so we can't leave a core until the kernel is |
| in a "safe place", i.e. it's state is bundled enough that it can be recontinued |
| later. Calls of this type are routine kernel messages, executed at a convenient |
| time (specifically, before we return to userspace in proc_restartcore(). |
| |
| This same thing applies to __death messages. Even though a process is dying, it |
| doesn't mean we can just drop whatever the kernel was doing on its behalf. For |
| instance, it might be holding a reference that will never get decreffed if its |
| stack gets dropped. |
| |
| 3. Leaving the Kernel Stack: |
| =========================== |
| Just because a message comes in saying to kill a process, it does not mean we |
| should immediately abandon_core(). The problem is more obvious when there is |
| a preempt message, instead of a death message, but either way there is state |
| that needs cleaned up (refcnts that need downed, etc). |
| |
| The solution to this is rather simple: don't abandon right away. That was |
| always somewhat the plan for preemption, but was never done for death. And |
| there are several other cases to worry about too. To enforce this, we expand |
| the old "active messages" into a generic work execution message (a kernel |
| message) that can be delayed or shipped to another core. These types of |
| messages will not be executed immediately on the receiving pcore - instead they |
| are on the queue for "when there's nothing else to do in the kernel", which is |
| checked in smp_idle() and before returning to userspace in proc_restartcore(). |
| Additionally, these kernel messages can also be queued on an alarm queue, |
| delaying their activation as part of a generic kernel alarm facility. |
| |
| One subtlety is that __proc_startcore() shouldn't check for messages, since it |
| is called by __startcore (a message). Checking there would run the messages out |
| of order, which is exactly what we are trying to avoid (total chaos). No one |
| should call __proc_startcore, other than proc_restartcore() or __startcore(). |
| If we ever have functions that do so, if they are not called from a message, |
| they must check for outstanding messages. |
| |
| This last subtlety is why we needed to change proc_run()'s _S case to use a |
| local message instead of calling proc_starcore (and why no one should ever call |
| proc_startcore()). We could unlock, thereby freeing another core to change the |
| proc state and send a message to us, then try to proc_startcore, and then |
| reading the message before we had installed current or had a userspace TF to |
| preempt, and probably a few other things. Treating _S as a local message is |
| cleaner, begs to be merged in the code with _M's code, and uses the messaging |
| infrastructure to avoid all the races that it was created to handle. |
| |
| Incidentally, we don't need to worry about missing messages while trying to pop |
| back to userspace from __proc_startcore, since an IPI will be on the way |
| (possibly a self-ipi caused by the __kernel_message() handler). This is also |
| why we needed to make process_routine_kmsg() keep interrupts disabled when it |
| stops (there's a race between checking the queue and disabling ints). |
| |
| 4. Preemption and Notification Issues: |
| =========================== |
| 4.1: Message Ordering and Local Calls: |
| --------------------------- |
| Since we go with the model of cores being told what to do, there are issues |
| with messages being received in the wrong order. That is why we have the |
| kernel messages (guaranteed, in-order delivery), with the proc-lock protecting |
| the send order. However, this is not enough for some rare races. |
| |
| Local calls can also perform the same tasks as messages (calling |
| proc_destroy() while a death IPI is on its way). We refer to these calls as |
| messing with "local fate" (compared to global state (we're clever). |
| Preempting a single vcore doesn't change the process's state). These calls |
| are a little different, because they also involve a check to see if it should |
| perform the function or other action (e.g., death just idling and waiting for |
| an IPI instead of trying to kill itself), instead of just blindly doing |
| something. |
| |
| 4.1.1: Possible Solutions |
| ---------------- |
| There are two ways to deal with this. One (and the better one, I think) is to |
| check state, and determine if it should proceed or abort. This requires that |
| all local-fate dependent calls always have enough state to do its job. In the |
| past, this meant that any function that results in sending a directive to a |
| vcore store enough info in the proc struct that a local call can determine if |
| it should take action or abort. In the past, we used the vcore/pcoremap as a |
| way to send info to the receiver about what vcore they are (or should be). |
| Now, we store that info in pcpui (for '__startcore', we send it as a |
| parameter. Either way, the general idea is still true: local calls can |
| proceed when they are called, and not self-ipi'd to a nebulous later time. |
| |
| The other way is to send the work (including the checks) in a self-ipi kernel |
| message. This will guarantee that the message is executed after any existing |
| messages (making the k_msg queue the authority for what should happen to a |
| core). The check is also performed later (when the k_msg executes). There |
| are a couple issues with this: if we allow the local core to send itself an |
| k_msg that could be out of order (meaning it should not be sent, and is only |
| sent due to ignorance of its sealed fate), AND if we return the core to the |
| idle-core-list once its fate is sealed, we need to detect that the message is |
| for the wrong process and that the process is in the wrong state. To do this, |
| we probably need local versioning on the pcore so it can detect that the |
| message is late/wrong. We might get by with just the proc* (though that is |
| tricky with death and proc reuse), so long as we don't allow new startcores |
| for a proc until AFTER the preemption is completed. |
| |
| 4.2: Preempt-Served Flag |
| ---------------- |
| We want to be able to consider a pcore free once its owning proc has dealt |
| with removing it. This allows a scheduler-like function to easily take a core |
| and then give it to someone else, without waiting for each vcore to respond, |
| saying that the pcore is free/idle. |
| |
| We used to not unmap until we were in '__preempt' or '__death', and we needed |
| a flag to tell yield-like calls that a message was already on the way and to |
| not rely on the vcoremap. This is pretty fucked up for a number of reasons, |
| so we changed that. But we still wanted to know when a preempt was in |
| progress so that the kernel could avoid giving out the vcore until the preempt |
| was complete. |
| |
| Here's the scenario: we send a '__startcore' to core 3 for VC5->PC3. Then we |
| quickly send a '__preempt' to 3, and then a '__startcore' to core 4 (a |
| different pcore) for VC5->PC4. Imagine all of this happens before the first |
| '__startcore' gets processed (IRQ delay, fast ksched, whatever). We need to |
| not run the second '__startcore' on pcore 4 before the preemption has saved |
| all of the state of the VC5. So we spin on preempt_served (which may get |
| renamed to preempt_in_progress). We need to do this in the sender, and not |
| the receiver (not in the kmsg), because the kmsgs can't tell which one they |
| are. Specifically, the first '__startcore' on core 3 runs the same code as |
| the '__startcore' on core 4, working on the same vcore. Anything we tell VC5 |
| will be seen by both PC3 and PC4. We'd end up deadlocking on PC3 while it |
| spins waiting for the preempt message that also needs to run on PC3. |
| |
| The preempt_pending flag is actual a timestamp, with the expiration time of |
| the core at which the message will be sent. We could try to use that, but |
| since alarms aren't fired at exactly the time they are scheduled, the message |
| might not actually be sent yet (though it will, really soon). Still, we'll |
| just go with the preempt-served flag for now. |
| |
| 4.3: Impending Notifications |
| ---------------- |
| It's also possible that there is an impending notification. There's no change |
| in fate (though there could be a fate-changing preempt on its way), just the |
| user wants a notification handler to run. We need a flag anyways for this |
| (discussed below), so proc_yield() or whatever other local call we have can |
| check this flag as well. |
| |
| Though for proc_yield(), it doesn't care if a notification is on its way (can |
| be dependent on a flag to yield from userspace, based on the nature of the |
| yield (which still needs to be sorted)). If the yield is in response to a |
| preempt_pending, it actually should yield and not receive the notification. |
| So it should destroy its vcoreid->pcoreid mapping and abandon_core(). When |
| that notification hits, it will be for a proc that isn't current, and will be |
| ignored (it will get run the next time that vcore fires up, handled below). |
| |
| There is a slight chance that the same proc will run on that pcore, but with a |
| different vcoreid. In the off chance this happens, the new vcore will get a |
| spurious notification. Userspace needs to be able to handle spurious |
| notifications anyways, (there are a couple other cases, and in general it's |
| not hard to do), so this is not a problem. Instead of trying to have the |
| kernel ignore the notification, we just send a spurious one. A crappy |
| alternative would be to send the vcoreid with the notification, but that would |
| mean we can't send a generic message (broadcast) to a bunch of cores, which |
| will probably be a problem later. |
| |
| Note that this specific case is because the "local work message" gets |
| processed out of order with respect to the notification. And we want this in |
| that case, since that proc_yield() is more important than the notification. |
| |
| 4.4: Preemption / Allocation Phases and Alarm Delays |
| --------------------------- |
| A per-vcore preemption phase starts when the kernel marks the core's |
| preempt_pending flag/counter and can includes the time when an alarm is |
| waiting to go off to reclaim the core. The phase ends when the vcore's pcore |
| is reclaimed, either as a result of the kernel taking control, or because a |
| process voluntarily yielded. |
| |
| Specifically, the preempt_pending variable is actually a timestamp for when |
| the core will be revoked (this assumes some form of global time, which we need |
| anyways). If its value is 0, then there is no preempt-pending, it is not in a |
| phase, and the vcore can be given out again. |
| |
| When a preempt alarm goes off, the alarm only means to check a process for |
| expired vcores. If the vcore has been yielded while the alarm was pending, |
| the preempt_pending flag will be reset to 0. To speed up the search for |
| vcores to preempt, there's a circular buffer corelist in the proc struct, with |
| vcoreids of potential suspects. Or at least this will exist at some point. |
| Also note that the preemption list isn't bound to a specific alarm: you can |
| check the list at any time (not necessarily on a specific alarm), and you can |
| have spurious alarms (the list is empty, so it'll be a noop). |
| |
| Likewise, a global preemption phase is when an entire MCP is getting |
| gang_prempted, and the global deadline is set. A function can quickly check |
| to see if the process responded, since the list of vcores with preemptions |
| pending will be empty. |
| |
| It seems obvious, but we do not allow allocation of a vcore during its |
| preemption phase. The main reason is that it can potentially break |
| assumptions about the vcore->pcore mapping and can result in multiple |
| instances of the same vcore on different pcores. Imagine a preempt message |
| sent to a pcore (after the alarm goes off), meanwhile that vcore/pcore yields |
| and the vcore reactivates somewhere else. There is a potential race on the |
| vcore_ctx state: the new vcore is reading while the old is writing. This |
| issue is sorted naturally: the vcore entry in the vcoremap isn't cleared until |
| the vcore/pcore is actually yielded/taken away, so the code looking for a free |
| vcoreid slot will not try to use it. |
| |
| Note that if we didn't design the alarm system to simply check for |
| preemptions (perhaps it has a stored list of vcores to preempt), then we |
| couldn't end the preempt-phase until the alarm was sorted. If that is the |
| case, we could easily give out a vcore that had been yielded but was still in |
| a preempt-phase. Stopping an alarm would be tricky too, since there could be |
| lots of vcores in different states that need to be sorted by the alarm (so |
| ripping it out isn't enough). Setting a flag might not be enough either. |
| Vcore version numbers/history (as well as global proc histories) is a pain I'd |
| like to avoid too. So don't change the alarm / delayed preemption system |
| without thinking about this. |
| |
| Also, allowing a vcore to restart while preemptions are pending also mucks |
| with keeping the vcore mapping "old" (while the message is in flight). A |
| pcore will want to use that to determine which vcore is running on it. It |
| would be possible to keep a pcoremap for the reverse mapping out of sync, but |
| that seems like a bad idea. In general, having the pcoremap is a good idea |
| (whenever we talk about a vcoremap, we're usually talking about both |
| directions: "the vcore->pcore mapping"). |
| |
| 4.5: Global Preemption Flags |
| --------------------------- |
| If we are trying to preempt an entire process at the same time, instead of |
| playing with the circular buffer of vcores pending preemption, we could have a |
| global timer as well. This avoids some O(n) operations, though it means that |
| userspace needs to check two "flags" (expiration dates) when grabbing its |
| preempt-critical locks. |
| |
| 4.6: Notifications Mixed with Preemption and Sleeping |
| --------------------------- |
| It is possible that notifications will mix with preemptions or come while a |
| process is not running. Ultimately, the process wants to be notified on a |
| given vcore. Whenever we send an active notification, we set a flag in procdata |
| (notif_pending). If the vcore is offline, we don't bother sending the IPI/notif |
| message. The kernel will make sure it runs the notification handler (as well as |
| restoring the vcore_ctx) the next time that vcore is restarted. Note that |
| userspace can toggle this, so they can handle the notifications from a different |
| core if it likes, or they can independently send a notification. |
| |
| Note we use notif_pending to detect if an IPI was missed while notifs were |
| disabled (this is done in pop_user_ctx() by userspace). The overall meaning |
| of notif_pending is that a vcore wants to be IPI'd. The IPI could be |
| in-flight, or it could be missed. Since notification IPIs can be spurious, |
| when we have potential races, we err on the side of sending. This happens |
| when pop_user_ctx() notifies itself, and when the kernel makes sure to start a |
| vcore in vcore context if a notif was pending. This was simplified a bit over |
| the years by having uthreads always get saved into the uthread_ctx (formerly |
| the notif_tf), instead of in the old preempt_tf (which is now the vcore_ctx). |
| |
| If a vcore has a preempt_pending, we will still send the active notification |
| (IPI). The core ought to get a notification for the preemption anyway, so we |
| need to be able to send one. Additionally, once the vcore is handling that |
| preemption notification, it will have notifs disabled, which will prevent us |
| from sending any extra notifications anyways. |
| |
| 4.7: Notifs While a Preempt Message is Served |
| --------------------------- |
| It is possible to have the kernel handling a notification k_msg and to have a |
| preempt k_msg in the queue (preempt-served flag is set). Ultimately, what we |
| want is for the core to be preempted and the notification handler to run on |
| the next execution. Both messages are in the k_msg queue for "a convenient |
| time to leave the kernel" (I'll have a better name for that later). What we |
| do is execute the notification handler and jump to userspace. Since there is |
| still an k_msg in the queue (and we self_ipi'd ourselves, it's part of how |
| k_msgs work), the IPI will fire and push us right back into the kernel to |
| execute the preemption, and the notif handler's context will be saved in the |
| vcore_ctx (ready to go when the vcore gets started again). |
| |
| We could try to just leave the notif_pending flag set and ignore the message, |
| but that would involve inspecting the queue for the preempt k_msg. |
| Additionally, a preempt k_msg can arrive anyway. Finally, it's possible to have |
| another message in the queue between the notif and the preempt, and it gets ugly |
| quickly trying to determine what to do. |
| |
| 4.8: When a Pcore is "Free" |
| --------------------------- |
| There are a couple ways to handle pcores. One approach would be to not |
| consider them free and able to be given to another process until the old |
| process is completely removed (abandon_core()). Another approach is to free |
| the core once its fate is sealed (which we do). This probably gives more |
| flexibility in schedule()-like functions (no need to wait to give the core |
| out), quicker dispatch latencies, less contention on shared structs (like the |
| idle-core-map), etc. |
| |
| This 'freeing' of the pcore is from the perspective of the kernel scheduler |
| and the proc struct. Contrary to all previous announcements, vcores are |
| unmapped from pcores when sending k_msgs (technically right after), while |
| holding the lock. The pcore isn't actually not-running-the-proc until the |
| kmsg completes and we abandon_core(). Previously, we used the vcoremap to |
| communicate to other cores in a lock-free manner, but that was pretty shitty |
| and now we just store the vcoreid in pcpu info. |
| |
| Another tricky part is the seq_ctr used to signal userspace of changes to the |
| coremap or num_vcores (coremap_seqctr). While we may not even need this in the |
| long run, it still seems like it could be useful. The trickiness comes from |
| when we update the seq_ctr when we are unmapping vcores on the receive side of a |
| message (like __death or __preempt). We'd rather not have each pcore contend on |
| the seq_ctr cache line (let alone any locking) while they perform a somewhat |
| data-parallel task. So we continue to have the sending core handle the seq_ctr |
| upping and downing. This works, since the "unlocking" happens after messages |
| are sent, which means the receiving core is no longer in userspace (if there is |
| a delay, it is because the remote core is in the kernel, possibly with |
| interrupts disabled). Because of this, userspace will be unable to read the new |
| value of the seq_ctr before the IPI hits and does the unmapping that the seq_ctr |
| protects/advertises. This is most likely true. It wouldn't be if the "last IPI |
| was sent" flag clears before the IPI actually hit the other core. |
| |
| 4.9: Future Broadcast/Messaging Needs |
| --------------------------- |
| Currently, messaging is serialized. Broadcast IPIs exist, but the kernel |
| message system is based on adding an k_msg to a list in a pcore's |
| per_cpu_info. Further the sending of these messages is in a loop. In the |
| future, we would like to have broadcast messaging of some sort (literally a |
| broadcast, like the IPIs, and if not that, then a communication tree of |
| sorts). |
| |
| In the past, (OLD INFO): given those desires, we wanted to make sure that no |
| message we send needs details specific to a pcore (such as the vcoreid running |
| on it, a history number, or anything like that). Thus no k_msg related to |
| process management would have anything that cannot apply to the entire |
| process. At this point, most just have a struct proc *. A pcore was be able |
| to figure out what is happening based on the pcoremap, information in the |
| struct proc, and in the preempt struct in procdata. |
| |
| In more recent revisions, the coremap no longer is meant to be used across |
| kmsgs, so some messages ('__startcore') send the vcoreid. This means we can't |
| easily broadcast the message. However, many broadcast mechanisms wouldn't |
| handle '__startcore' naturally. For instance, logical IPIs need something |
| already set in the LAPIC, or maybe need to be sent to a somewhat predetermined |
| group (again, bits in the LAPIC). If we tried this for '__startcore', we |
| could add something in to the messaging to carry these vcoreids. More likely, |
| we'll have a broadcast tree. Keeping vcoreid (or any arg) next to whoever |
| needs to receive the message is a very small amount of bookkeeping on a struct |
| that already does bookkeeping. |
| |
| 4.10: Other Things We Thought of but Don't Like |
| --------------------------- |
| All local fate-related work is sent as a self k_msg, to enforce ordering. |
| It doesn't capture the difference between a local call and a remote k_msg. |
| The k_msg has already considered state and made its decision. The local call |
| is an attempt. It is also unnecessary, if we put in enough information to |
| make a decision in the proc struct. Finally, it caused a few other problems |
| (like needing to detect arbitrary stale messages). |
| |
| Overall message history: doesn't work well when you do per-core stuff, since |
| it will invalidate other messages for the process. We then though of a pcore |
| history counter to detect stale messages. Don't like that either. We'd have |
| to send the history in the message, since it's a per-message, per-core |
| expiration. There might be other ways around this, but this doesn't seem |
| necessary. |
| |
| Alarms have pointers to a list of which cores should be preempted when that |
| specific alarm goes off (saved with the alarm). Ugh. It gets ugly with |
| multiple outstanding preemptions and cores getting yielded while the alarms |
| sleep (and possibly could get reallocated, though we'd make a rule to prevent |
| that). Like with notifications, being able to handle spurious alarms and |
| thinking of an alarm as just a prod to check somewhere is much more flexible |
| and simple. It is similar to generic messages that have the actual important |
| information stored somewhere else (as with allowing broadcasts, with different |
| receivers performing slightly different operations). |
| |
| Synchrony for messages (wanting a response to a preempt k_msg, for example) |
| sucks. Just encode the state of impending fate in the proc struct, where it |
| belongs. Additionally, we don't want to hold the proc lock even longer than |
| we do now (which is probably too long as it is). Finally, it breaks a golden |
| rule: never wait while holding a lock: you will deadlock the system (e.g. if |
| the receiver is already in the kernel spinning on the lock). We'd have to |
| send messages, unlock (which might cause a message to hit the calling pcore, |
| as in the case of locally called proc_destroy()), and in the meantime some |
| useful invariant might be broken. |
| |
| We also considered using the transition stack as a signal that a process is in |
| a notification handler. The kernel can inspect the stack pointer to determine |
| this. It's possible, but unnecessary. |
| |
| Using the pcoremap as a way to pass info with kmsgs: it worked a little, but |
| had some serious problems, as well as making life difficult. It had two |
| purposes: help with local fate calls (yield) and allow broadcast messaging. |
| The main issue is that it was using a global struct to pass info with |
| messages, but it was based on the snapshot of state at the time the message |
| was sent. When you send a bunch of messages, certain state may have changed |
| between messages, and the old snapshot isn't there anymore by the time the |
| message gets there. To avoid this, we went through some hoops and had some |
| fragile code that would use other signals to avoid those scenarios where the |
| global state change would send the wrong message. It was tough to understand, |
| and not clear it was correct (hint: it wasn't). Here's an example (on one |
| pcore): if we send a preempt and we then try to map that pcore to another |
| vcore in the same process before the preempt call checks its pcoremap, we'll |
| clobber the pcore->vcore mapping (used by startcore) and the preempt will |
| think it is the new vcore, not the one it was when the message was sent. |
| While this is a bit convoluted, I can imagine a ksched doing this, and |
| perhaps with weird IRQ delays, the messages might get delayed enough for it to |
| happen. I'd rather not have to have the ksched worry about this just because |
| proc code was old and ghetto. Another reason we changed all of this was so |
| that you could trust the vcoremap while holding the lock. Otherwise, it's |
| actually non-trivial to know the state of a vcore (need to check a combination |
| of preempt_served and is_mapped), and even if you do that, there are some |
| complications with doing this in the ksched. |
| |
| 5. current_ctx and owning_proc |
| =========================== |
| Originally, current_tf was a per-core macro that returns a struct trapframe * |
| that points back on the kernel stack to the user context that was running on |
| the given core when an interrupt or trap happened. Saving the reference to |
| the TF helps simplify code that needs to do something with the TF (like save |
| it and pop another TF). This way, we don't need to pass the context all over |
| the place, especially through code that might not care. |
| |
| Then, current_tf was more broadly defined as the user context that should be |
| run when the kernel is ready to run a process. In the older case, it was when |
| the kernel tries to return to userspace from a trap/interrupt. current_tf |
| could be set by an IPI/KMSG (like '__startcore') so that when the kernel wants |
| to idle, it will find a current_tf that it needs to run, even though we never |
| trapped in on that context in the first place. |
| |
| Finally, current_tf was changed to current_ctx, and instead of tracking a |
| struct trapframe (equivalent to a hw_trapframe), it now tracked a struct |
| user_context, which could be either a HW or a SW trapframe. |
| |
| Further, we now have 'owning_proc', which tells the kernel which process |
| should be run. 'owning_proc' is a bigger deal than 'current_ctx', and it is |
| what tells us to run cur_ctx. |
| |
| Process management KMSGs now simply modify 'owning_proc' and cur_ctx, as if we |
| had interrupted a process. Instead of '__startcore' forcing the kernel to |
| actually run the process and trapframe, it will just mean we will eventually |
| run it. In the meantime a '__notify' or a '__preempt' can come in, and they |
| will apply to the owning_proc/cur_ctx. This greatly simplifies process code |
| and code calling process code (like the scheduler), since we no longer need to |
| worry about whether or not we are getting a "stack killing" kernel message. |
| Before this, code needed to care where it was running when managing _Ms. |
| |
| Note that neither 'current_ctx' nor 'owning_proc' rely on 'current'/'cur_proc'. |
| 'current' is just what process context we're in, not what process (and which |
| trapframe) we will eventually run. |
| |
| cur_ctx does not point to kernel trapframes, which is important when we |
| receive an interrupt in the kernel. At one point, we were (hypothetically) |
| clobbering the reference to the user trapframe, and were unable to recover. |
| We can get away with this because the kernel always returns to its previous |
| context from a nested handler (via iret on x86). |
| |
| In the future, we may need to save kernel contexts and may not always return |
| via iret. At which point, if the code path is deep enough that we don't want |
| to carry the TF pointer, we may revisit this. Until then, current_ctx is just |
| for userspace contexts, and is simply stored in per_cpu_info. |
| |
| Brief note from the future (months after this paragraph was written): cur_ctx |
| has two aspects/jobs: |
| 1) tell the kernel what we should do (trap, fault, sysc, etc), how we came |
| into the kernel (the fact that it is a user tf), which is why we copy-out |
| early on |
| 2) be a vehicle for us to restart the process/vcore |
| |
| We've been focusing on the latter case a lot, since that is what gets |
| removed when preempted, changed during a notify, created during a startcore, |
| etc. Don't forget it was also an instruction of sorts. The former case is |
| always true throughout the life of the syscall. The latter only happens to be |
| true throughout the life of a *non-blocking* trap since preempts are routine |
| KMSGs. But if we block in a syscall, the cur_ctx is no longer the TF we came |
| in on (and possibly the one we are asked to operate on), and that old cur_ctx |
| has probably restarted. |
| |
| (Note that cur_ctx is a pointer, and syscalls/traps actually operate on the TF |
| they came in on regardless of what happens to cur_ctx or pcpui->actual_tf.) |
| |
| 6. Locking! |
| =========================== |
| 6.1: proc_lock |
| --------------------------- |
| Currently, all locking is done on the proc_lock. It's main goal is to protect |
| the vcore mapping (vcore->pcore and vice versa). As of Apr 2010, it's also used |
| to protect changes to the address space and the refcnt. Eventually the refcnt |
| will be handled with atomics, and the address space will have it's own MM lock. |
| |
| We grab the proc_lock all over the place, but we try to avoid it whereever |
| possible - especially in kernel messages or other places that will be executed |
| in parallel. One place we do grab it but would like to not is in proc_yield(). |
| We don't always need to grab the proc lock. Here are some examples: |
| |
| 6.1.1: Lockless Notifications: |
| ------------- |
| We don't lock when sending a notification. We want the proc_lock to not be an |
| irqsave lock (discussed below). Since we might want to send a notification from |
| interrupt context, we can't grab the proc_lock if it's a regular lock. |
| |
| This is okay, since the proc_lock is only protecting the vcoremapping. We could |
| accidentally send the notification to the wrong pcore. The __notif handler |
| checks to make sure it is the right process, and all _M processes should be able |
| to handle spurious notifications. This assumes they are still _M. |
| |
| If we send it to the wrong pcore, there is a danger of losing the notif, since |
| it didn't go to the correct vcore. That would happen anyway, (the vcore is |
| unmapped, or in the process of mapping). The notif_pending flag will be caught |
| when the vcore is started up next time (and that flag was set before reading the |
| vcoremap). |
| |
| 6.1.2: Local get_vcoreid(): |
| ------------- |
| It's not necessary to lock while checking the vcoremap if you are checking for |
| the core you are running on (e.g. pcoreid == core_id()). This is because all |
| unmappings of a vcore are done on the receive side of a routine kmsg, and that |
| code cannot run concurrently with the code you are running. |
| |
| 6.2: irqsave |
| --------------------------- |
| The proc_lock used to be an irqsave lock (meaning it disables interrupts and can |
| be grabbed from interrupt context). We made it a regular lock for a couple |
| reasons. The immediate one was it was causing deadlocks due to some other |
| ghetto things (blocking on the frontend server, for instance). More generally, |
| we don't want to disable interrupts for long periods of time, so it was |
| something worth doing anyway. |
| |
| This means that we cannot grab the proc_lock from interrupt context. This |
| includes having schedule called from an interrupt handler (like the |
| timer_interrupt() handler), since it will call proc_run. Right now, we actually |
| do this, which we shouldn't, and that will eventually get fixed. The right |
| answer is that the actual work of running the scheduler should be a routine |
| kmsg, similar to how Linux sets a bit in the kernel that it checks on the way |
| out to see if it should run the scheduler or not. |
| |
| 7. TLB Coherency |
| =========================== |
| When changing or removing memory mappings, we need to do some form of a TLB |
| shootdown. Normally, this will require sending an IPI (immediate kmsg) to |
| every vcore of a process to unmap the affected page. Before allocating that |
| page back out, we need to make sure that every TLB has been flushed. |
| |
| One reason to use a kmsg over a simple handler is that we often want to pass a |
| virtual address to flush for those architectures (like x86) that can |
| invalidate a specific page. Ideally, we'd use a broadcast kmsg (doesn't exist |
| yet), though we already have simple broadcast IPIs. |
| |
| 7.1 Initial Stuff |
| --------------------------- |
| One big issue is whether or not to wait for a response from the other vcores |
| that they have unmapped. There are two concerns: 1) Page reuse and 2) User |
| semantics. We cannot give out the physical page while it may still be in a |
| TLB (even to the same process. Ask us about the pthread_test bug). |
| |
| The second case is a little more detailed. The application may not like it if |
| it thinks a page is unmapped or protected, and it does not generate a fault. |
| I am less concerned about this, especially since we know that even if we don't |
| wait to hear from every vcore, we know that the message was delivered and the |
| IPI sent. Any cores that are in userspace will have trapped and eventually |
| handle the shootdown before having a chance to execute other user code. The |
| delays in the shootdown response are due to being in the kernel with |
| interrupts disabled (it was an IMMEDIATE kmsg). |
| |
| 7.2 RCU |
| --------------------------- |
| One approach is similar to RCU. Unmap the page, but don't put it on the free |
| list. Instead, don't reallocate it until we are sure every core (possibly |
| just affected cores) had a chance to run its kmsg handlers. This time is |
| similar to the RCU grace periods. Once the period is over, we can then truly |
| free the page. |
| |
| This would require some sort of RCU-like mechanism and probably a per-core |
| variable that has the timestamp of the last quiescent period. Code caring |
| about when this page (or pages) can be freed would have to check on all of the |
| cores (probably in a bitmask for what needs to be freed). It would make sense |
| to amortize this over several RCU-like operations. |
| |
| 7.3 Checklist |
| --------------------------- |
| It might not suck that much to wait for a response if you already sent an IPI, |
| though it incurs some more cache misses. If you wanted to ensure all vcores |
| ran the shootdown handler, you'd have them all toggle their bit in a checklist |
| (unused for a while, check smp.c). The only one who waits would be the |
| caller, but there still are a bunch of cache misses in the handlers. Maybe |
| this isn't that big of a deal, and the RCU thing is an unnecessary |
| optimization. |
| |
| 7.4 Just Wait til a Context Switch |
| --------------------------- |
| Another option is to not bother freeing the page until the entire process is |
| descheduled. This could be a very long time, and also will mess with |
| userspace's semantics. They would be running user code that could still |
| access the old page, so in essence this is a lazy munmap/mprotect. The |
| process basically has the page in pergatory: it can't be reallocated, and it |
| might be accessible, but can't be guaranteed to work. |
| |
| The main benefit of this is that you don't need to send the TLB shootdown IPI |
| at all - so you don't interfere with the app. Though in return, they have |
| possibly weird semantics. One aspect of these weird semantics is that the |
| same virtual address could map to two different pages - that seems like a |
| disaster waiting to happen. We could also block that range of the virtual |
| address space from being reallocated, but that gets even more tricky. |
| |
| One issue with just waiting and RCU is memory pressure. If we actually need |
| the page, we will need to enforce an unmapping, which sucks a little. |
| |
| 7.5 Bulk vs Single |
| --------------------------- |
| If there are a lot of pages being shot down, it'd be best to amortize the cost |
| of the kernel messages, as well as the invlpg calls (single page shootdowns). |
| One option would be for the kmsg to take a range, and not just a single |
| address. This would help with bulk munmap/mprotects. Based on the number of |
| these, perhaps a raw tlbflush (the entire TLB) would be worth while, instead |
| of n single shots. Odds are, that number is arch and possibly workload |
| specific. |
| |
| For now, the plan will be to send a range and have them individually shot |
| down. |
| |
| 7.6 Don't do it |
| --------------------------- |
| Either way, munmap/mprotect sucks in an MCP. I recommend not doing it, and |
| doing the appropriate mmap/munmap/mprotects in _S mode. Unfortunately, even |
| our crap pthread library munmaps on demand as threads are created and |
| destroyed. The vcore code probably does in the bowels of glibc's TLS code |
| too, though at least that isn't on every user context switch. |
| |
| 7.7 Local memory |
| --------------------------- |
| Private local memory would help with this too. If each vcore has its own |
| range, we won't need to send TLB shootdowns for those areas, and we won't have |
| to worry about weird application semantics. The downside is we would need to |
| do these mmaps in certain ranges in advance, and might not easily be able to |
| do them remotely. More on this when we actually design and build it. |
| |
| 7.8 Future Hardware Support |
| --------------------------- |
| It would be cool and interesting if we had the ability to remotely shootdown |
| TLBs. For instance, all cores with cr3 == X, shootdown range Y..Z. It's |
| basically what we'll do with the kernel message and the vcoremap, but with |
| magic hardware. |
| |
| 7.9 Current Status |
| --------------------------- |
| For now, we just send a kernel message to all vcores to do a full TLB flush, |
| and not to worry about checklists, waiting, or anything. This is due to being |
| short on time and not wanting to sort out the issue with ranges. The way |
| it'll get changed to is to send the kmsg with the range to the appropriate |
| cores, and then maybe put the page on the end of the freelist (instead of the |
| head). More to come. |
| |
| 8. Process Management |
| =========================== |
| 8.1 Vcore lists |
| --------------------------- |
| We have three lists to track a process's vcores. The vcores themselves sit in |
| the vcoremap in procinfo; they aren't dynamically allocated (memory) or |
| anything like that. The lists greatly eases vcore discovery and management. |
| |
| A vcore is on exactly one of three lists: online (mapped and running vcores, |
| sometimes called 'active'), bulk_preempt (was online when the process was bulk |
| preempted (like a timeslice)), and inactive (yielded, hasn't come on yet, |
| etc). When writes are complete (unlocked), either the online list or the |
| bulk_preempt list should be empty. |
| |
| List modifications are protected by the proc_lock. You can concurrently read, |
| but note you may get some weird behavior, such as a vcore on multiple lists, a |
| vcore on no lists, online and bulk_preempt both having items, etc. Currently, |
| event code will read these lists when hunting for a suitable core, and will |
| have to be careful about races. I want to avoid event FALLBACK code from |
| grabbing the proc_lock. |
| |
| Another slight thing to be careful of is that the vcore lists don't always |
| agree with the vcore mapping. However, it will always agree with what the |
| state of the process will be when all kmsgs are processed (fate). |
| Specifically, when we take vcores, the unmapping happens with the lock not |
| held on the vcore itself (as discussed elsewhere). The vcore lists represent |
| the result of those pending unmaps. |
| |
| Before we used the lists, we scanned the vcoremap in a painful, clunky manner. |
| In the old style, when you asked for a vcore, the first one you got was the |
| first hole in the vcoremap. Ex: Vcore0 would always be granted if it was |
| offline. That's no longer true; the most recent vcore yielded will be given |
| out next. This will help with cache locality, and also cuts down on the |
| scenarios on which the kernel gives out a vcore that userspace wasn't |
| expecting. This can still happen if they ask for more vcores than they set up |
| for, or if a vcore doesn't *want* to come online (there's a couple scenarios |
| with preemption recovery where that may come up). |
| |
| So the plan with the bulk preempt list is that vcores on it were preempted, |
| and the kernel will attempt to restart all of them (and move them to the online |
| list). Any leftovers will be moved to the inactive list, and have preemption |
| recovery messages sent out. Any shortages (they want more vcores than were |
| bulk_preempted) will be taken from the yield list. This all means that |
| whether or not a vcore needs to be preempt-recovered or if there is a message |
| out about its preemption doesn't really affect which list it is on. You could |
| have a vcore on the inactive list that was bulk preempted (and not turned back |
| on), and then that vcore gets granted in the next round of vcore_requests(). |
| The preemption recovery handlers will need to deal with concurrent handlers |
| and the vcore itself starting back up. |
| |
| 9. On the Ordering of Messages and Bugs with Old State |
| =========================== |
| This is a sordid tale involving message ordering, message delivery times, and |
| finding out (sometimes too late) that the state you expected is gone and |
| having to deal with that error. |
| |
| A few design issues: |
| - being able to send messages and have them execute in the order they are |
| sent |
| - having message handlers resolve issues with global state. Some need to know |
| the correct 'world view', and others need to know what was the state at the |
| time they were sent. |
| - realizing syscalls, traps, faults, and any non-IRQ entry into the kernel is |
| really a message. |
| |
| Process management messages have alternated from ROUTINE to IMMEDIATE and now |
| back to ROUTINE. These messages include such family favorites as |
| '__startcore', '__preempt', etc. Meanwhile, syscalls were coming in that |
| needed to know about the core and the process's state (specifically, yield, |
| change_to, and get_vcoreid). Finally, we wanted to avoid locking, esp in |
| KMSGs handlers (imagine all cores grabbing the lock to check the vcoremap or |
| something). |
| |
| Incidentally, events were being delivered concurretly to vcores, though that |
| actually didn't matter much (check out async_events.txt for more on that). |
| |
| 9.1: Design Guidelines |
| --------------------------- |
| Initially, we wanted to keep broadcast messaging available as an option. As |
| noted elsewhere, we can't really do this well for startcore, since most |
| hardware broadcast options need some initial per-core setup, and any sort of |
| broadcast tree we make should be able to handle a small message. Anyway, this |
| desire in the early code to keep all messages identical lead to a few |
| problems. |
| |
| Another objective of the kernel messaging was to avoid having the message |
| handlers grab any locks, especially the same lock (the proc lock is used to |
| protect the vcore map, for instance). |
| |
| Later on, a few needs popped up that motivated the changes discussed below: |
| - Being able to find out which proc/vcore was on a pcore |
| - Not having syscalls/traps require crazy logic if the carpet was pulled out |
| from under them. |
| - Having proc management calls return. This one was sorted out by making all |
| kmsg handlers return. It would be a nightmare making a ksched without this. |
| |
| 9.2: Looking at Old State: a New Bug for an Old Problem |
| --------------------------- |
| We've always had issues with syscalls coming in and already had the fate of a |
| core determined. This is referred to in a few places as "predetermined fate" |
| vs "local state". A remote lock holder (ksched) already determined a core |
| should be unmapped and sent a message. Only later does some call like |
| proc_yield() realize its core is already *unmapped*. (I use that term poorly |
| here). This sort of code had to realize it was working on an old version of |
| state and just abort. This was usually safe, though looking at the vcoremap |
| was a bad idea. Initially, we used preempt_served as the signal, which was |
| okay. Around 12b06586 yield started to use the vcoremap, which turned out to |
| be wrong. |
| |
| A similar issue happens for the vcore messages (startcore, preempt, etc). The |
| way startcore used to work was that it would only know what pcore it was on, |
| and then look into the vcoremap to figure out what vcoreid it should be |
| running. This was to keep broadcast messaging available as an option. The |
| problem with it is that the vcoremap may have changed between when the |
| messages were sent and when they were executed. Imagine a startcore followed |
| by a preempt, afterwhich the vcore was unmapped. Well, to get around that, we |
| had the unmapping happen in the preempt or death handlers. Yikes! This was |
| the case back in the early days of ROS. This meant the vcoremap wasn't |
| actually representative of the decisions the ksched made - we also needed to |
| look at the state we'd have after all outstanding messages executed. And this |
| would differ from the vcore lists (which were correct for a lock holder). |
| |
| This was managable for a little while, until I tried to conclusively know who |
| owned a particular pcore. This came up while making a provisioning scheduler. |
| Given a pcore, tell me which process/vcore (if any) were on it. It was rather |
| tough. Getting the proc wasn't too hard, but knowing which vcore was a little |
| tougher. (Note the ksched doesn't care about which vcore is running, and the |
| process can change vcores on a pcore at will). But once you start looking at |
| the process, you can't tell which vcore a certain pcore has. The vcoremap may |
| be wrong, since a preempt is already on the way. You would have had to scan |
| the vcore lists to see if the proc code thought that vcore was online or not |
| (which would mean there had been no preempts). This is the pain I was talking |
| about back around commit 5343a74e0. |
| |
| So I changed things so that the vcoremap was always correct for lock holders, |
| and used pcpui to track owning_vcoreid (for preempt/notify), and used an extra |
| KMSG variable to tell startcore which vcoreid it should use. In doing so, we |
| (re)created the issue that the delayed unmapping dealt with: the vcoremap |
| would represent *now*, and not the vcoremap of when the messages were first |
| sent. However, this had little to do with the KMSGs, which I was originally |
| worried about. No one was looking at the vcoremap without the lock, so the |
| KMSGs were okay, but remember: syscalls are like messages too. They needed to |
| figure out what vcore they were on, i.e. what vcore userspace was making |
| requests on (viewing a trap/fault as a type of request). |
| |
| Now the problem was that we were using the vcoremap to figure out which vcore |
| we were supposed to be. When a syscall finally ran, the vcoremap could be |
| completely wrong, and with immediate KMSGs (discussed below), the pcpui was |
| already changed! We dealt with the problem for KMSGs, but not syscalls, and |
| basically reintroduced the bug of looking at current state and thinking it |
| represented the state from when the 'message' was sent (when we trapped into |
| the kernel, for a syscall/exception). |
| |
| 9.3: Message Delivery, Circular Waiting, and Having the Carpet Pulled Out |
| --------------------------- |
| In-order message delivery was what drove me to build the kernel messaging |
| system in the first place. It provides in-order messages to a particular |
| pcore. This was enough for a few scenarios, such as preempts racing ahead of |
| startcores, or deaths racing a head of preempts, etc. However, I also wanted |
| an ordering of messages related to a particular vcore, and this wasn't |
| apparent early on. |
| |
| The issue first popped up with a startcore coming quickly on the heals of a |
| preempt for the same VC, but on different PCs. The startcore cannot proceed |
| until the preempt saved the TF into the VCPD. The old way of dealing with |
| this was to spin in '__map_vcore()'. This was problematic, since it meant we |
| were spinning while holding a lock, and resulted in some minor bugs and issues |
| with lock ordering and IRQ disabling (couldn't disable IRQs and then try to |
| grab the lock, since the lock holder could have sent you a message and is |
| waiting for you to handle the IRQ/IMMED KMSG). However, it was doable. But |
| what wasn't doable was to have the KMSGs be ROUTINE. Any syscalls that tried |
| to grab the proc lock (lots of them) would deadlock, since the lock holder was |
| waiting on us to handle the preempt (same circular waiting issue as above). |
| |
| This was fine, albeit subpar, until a new issue showed up. Sending IMMED |
| KMSGs worked fine if we were coming from userspace already, but if we were in |
| the kernel, those messages would run immediately (hence the name), just like |
| an IRQ handler, and could confuse syscalls that touched cur_ctx/pcpui. If a |
| preempt came in during a syscall, the process/vcore could be changed before |
| the syscall took place. Some syscalls could handle this, albeit poorly. |
| sys_proc_yield() and sys_change_vcore() delicately tried to detect if they |
| were still mapped or not and use that to determine if a preemption happened. |
| |
| As mentioned above, looking at the vcoremap only tells you what is currently |
| happening, and not what happened in the past. Specifically, it doesn't tell |
| you the state of the mapping when a particular core trapped into the kernel |
| for a syscall (referred to as when the 'message' was sent up above). Imagine |
| sys_get_vcoreid(): you trap in, then immediately get preempted, then startcore |
| for the same process but a different vcoreid. The syscall would return with |
| the vcoreid of the new vcore, since it cannot tell there was a change. The |
| async syscall would complete and we'd have a wrong answer. While this never |
| happened to me, I had a similar issue while debugging some other bugs (I'd get |
| a vcoreid of 0xdeadbeef, for instance, which was the old poison value for an |
| unmapped vcoreid). There are a bunch of other scenarios that trigger similar |
| disasters, and they are very hard to avoid. |
| |
| One way out of this was a per-core history counter, that changed whenever we |
| changed cur_ctx. Then when we trapped in for a syscall, we could save the |
| value, enable_irqs(), and go about our business. Later on, we'd have to |
| disable_irqs() and compare the counters. If they were different, we'd have to |
| bail out some how. This could have worked for change_to and yield, and some |
| others. But any syscall that wanted to operate on cur_ctx in some way would |
| fail (imagine a hypothetical sys_change_stack_pointer()). The context that |
| trapped has already returned on another core. I guess we could just fail that |
| syscall, though it seems a little silly to not be able to do that. |
| |
| The previous example was a bit contrived, but lets also remember that it isn't |
| just syscalls: all exceptions have the same issue. Faults might be fixable, |
| since if you restart a faulting context, it will start on the faulting |
| instruction. However all traps (like syscall) restart on the next |
| instruction. Hope we don't want to do anything fancy with breakpoint! Note |
| that I had breakpointing contexts restart on other pcores and continue while I |
| was in the breakpoint handler (noticed while I was debugging some bugs with |
| lots of preempts). Yikes. And don't forget we eventually want to do some |
| complicated things with the page fault handler, and may want to turn on |
| interrupts / kthread during a page fault (imaging hitting disk). Yikes. |
| |
| So I looked into going back to ROUTINE kernel messages. With ROUTINE |
| messages, I didn't have to worry about having the carpet pulled out from under |
| syscalls and exceptions (traps, faults, etc). The 'carpet' is stuff like |
| cur_ctx, owning_proc, owning_vcoreid, etc. We still cannot trust the vcoremap, |
| unless we *know* there were no preempts or other KMSGs waiting for us. |
| (Incidentally, in the recent fix a93aa7559, we merely use the vcoremap as a |
| sanity check). |
| |
| However, we can't just switch back to ROUTINEs. Remember: with ROUTINEs, |
| we will deadlock in '__map_vcore()', when it waits for the completion of |
| preempt. Ideally, we would have had startcore spin on the signal. Since we |
| already gave up on using x86-style broadcast IPIs for startcore (in |
| 5343a74e0), we might as well pass along a history counter, so it knows to wait |
| on preempt. |
| |
| 9.4: The Solution |
| --------------------------- |
| To fix up all of this, we now detect preemptions in syscalls/traps and order |
| our kernel messages with two simple per-vcore counters. Whenever we send a |
| preempt, we up one counter. Whenever that preempt finishes, it ups another |
| counter. When we send out startcores, we send a copy of the first counter. |
| This is a way of telling startcore where it belongs in the list of messages. |
| More specifically, it tells it which preempt happens-before it. |
| |
| Basically, I wanted a partial ordering on my messages, so that messages sent |
| to a particular vcore are handled in the order they were sent, even if those |
| messages run on different physical cores. |
| |
| It is not sufficient to use a seq counter (one integer, odd values for |
| 'preempt in progress' and even values for 'preempt done'). It is possible to |
| have multiple preempts in flight for the same vcore, albeit with startcores in |
| between. Still, there's no way to encode that scenario in just one counter. |
| |
| Here's a normal example of traffic to some vcore. I note both the sending and |
| the execution of the kmsgs: |
| nr_pre_sent nr_pre_done pcore message sent/status |
| ------------------------------------------------------------- |
| 0 0 X startcore (nr_pre_sent == 0) |
| 0 0 X startcore (executes) |
| 1 0 X preempt (kmsg sent) |
| 1 1 Y preempt (executes) |
| 1 1 Y startcore (nr_pre_sent == 1) |
| 1 1 Y startcore (executes) |
| |
| Note the messages are always sent by the lockholder in the order of the |
| example above. |
| |
| Here's when the startcore gets ahead of the prior preempt: |
| nr_pre_sent nr_pre_done pcore message sent/status |
| ------------------------------------------------------------- |
| 0 0 X startcore (nr_pre_sent == 0) |
| 0 0 X startcore (executes) |
| 1 0 X preempt (kmsg sent) |
| 1 0 Y startcore (nr_pre_sent == 1) |
| 1 1 X preempt (executes) |
| 1 1 Y startcore (executes) |
| |
| Note that this can only happen across cores, since KMSGs to a particular core |
| are handled in order (for a given class of message). The startcore blocks on |
| the prior preempt. |
| |
| Finally, here's an example of what a seq ctr can't handle: |
| nr_pre_sent nr_pre_done pcore message sent/status |
| ------------------------------------------------------------- |
| 0 0 X startcore (nr_pre_sent == 0) |
| 1 0 X preempt (kmsg sent) |
| 1 0 Y startcore (nr_pre_sent == 1) |
| 2 0 Y preempt (kmsg sent) |
| 2 0 Z startcore (nr_pre_sent == 2) |
| 2 1 X preempt (executes (upped to 1)) |
| 2 1 Y startcore (executes (needed 1)) |
| 2 2 Y preempt (executes (upped to 2)) |
| 2 Z Z startcore (executes (needed 2)) |
| |
| As a nice bonus, it is easy for syscalls that care about the vcoreid (yield, |
| change_to, get_vcoreid) to check if they have a preempt_served. Just grab the |
| lock (to prevent further messages being sent), then check the counters. If |
| they are equal, there is no preempt on its way. This actually was the |
| original way we checked for preempts in proc_yield back in the day. It was |
| just called preempt_served. Now, it is split into two counters, instead of |
| just being a bool. |
| |
| Regardless of whether or not we were preempted, we still can look at |
| pcpui->owning_proc and owning_vcoreid to figure out what the vcoreid of the |
| trap/syscall is, and we know that the cur_ctx is still the correct cur_ctx (no |
| carpet pulled out), since while there could be a preempt ROUTINE message |
| waiting for us, we simply haven't run it yet. So calls like yield should |
| still fail (since your core has been unmapped and you need to bail out and run |
| the preempt handler), but calls like sys_change_stack_pointer can proceed. |
| More importantly than that old joke syscall, the page fault handler can try to |
| do some cool things without worrying about really crazy stuff. |
| |
| 9.5: Why We (probably) Don't Deadlock |
| --------------------------- |
| It's worth thinking about why this setup of preempts and startcores can't |
| deadlock. Anytime we spin in the kernel, we ought to do this. Perhaps there |
| is some issue with other KMSGs for other processes, or other vcores, or |
| something like that that can cause a deadlock. |
| |
| Hypothetical case: pcore 1 has a startcore for vc1 which is stuck behind vc2's |
| startcore on PC2, with time going upwards. In these examples, startcores are |
| waiting on particular preempts, subject to the nr_preempts_sent parameter sent |
| along with the startcores. |
| |
| ^ |
| | _________ _________ |
| | | | | | |
| | | pr vc 2 | | pr vc 1 | |
| | |_________| |_________| |
| | |
| | _________ _________ |
| | | | | | |
| | | sc vc 1 | | sc vc 2 | |
| | |_________| |_________| |
| t |
| --------------------------------------------------------------------------- |
| ______ ______ |
| | | | | |
| | PC 1 | | PC 2 | |
| |______| |______| |
| |
| Here's the same picture, but with certain happens-before arrows. We'll use X --> Y to |
| mean X happened before Y, was sent before Y. e.g., a startcore is sent after |
| a preempt. |
| |
| ^ |
| | _________ _________ |
| | | | | | |
| | .-> | pr vc 2 | --. .----- | pr vc 1 | <-. |
| | | |_________| \ / & |_________| | |
| | * | \/ | * |
| | | _________ /\ _________ | |
| | | | | / \ & | | | |
| | '-- | sc vc 1 | <-' '----> | sc vc 2 | --' |
| | |_________| |_________| |
| t |
| --------------------------------------------------------------------------- |
| ______ ______ |
| | | | | |
| | PC 1 | | PC 2 | |
| |______| |______| |
| |
| The arrows marked with * are ordered like that due to the property of KMSGs, |
| in that we have in order delivery. Messages are executed in the order in |
| which they were sent (serialized with a spinlock btw), so on any pcore, |
| messages that are further ahead in the queue were sent before (and thus will |
| be run before) other messages. |
| |
| The arrows marked with a & are ordered like that due to how the proc |
| management code works. The kernel won't send out a startcore for a particular |
| vcore before it sent out a preempt. (Note that techincally, preempts follow |
| startcores. The startcores in this example are when we start up a vcore after |
| it had been preempted in the past.). |
| |
| Anyway, note that we have a cycle, where all events happened before each |
| other, which isn't possible. The trick to connecting "unrelated" events like |
| this (unrelated meaning 'not about the same vcore') in a happens-before manner |
| is the in-order properties of the KMSGs. |
| |
| Based on this example, we can derive general rules. Note that 'sc vc 2' could |
| be any kmsg that waits on another message placed behind 'sc vc 1'. This would |
| require us having sent a KMSG that waits on a KMSGs that we send later. Bad |
| idea! (you could have sent that KMSGs to yourself, aside from just being |
| dangerous). If you want to spin, make sure you send the work that should |
| happen-before actually-before the waiter. |
| |
| In fact, we don't even need 'sc vc 2' to be a KMSG. It could be miscellaneous |
| kernel code, like a proc mgmt syscall. Imagine if we did something like the |
| old '__map_vcore' call from within the ksched. That would be code that holds |
| the lock, and then waits on the execution of a message handler. That would |
| deadlock (which is why we don't do it anymore). |
| |
| Finally, in case this isn't clear, all of the startcores and preempts for |
| a given vcore exist in a happens-before relation, both in sending and in |
| execution. The sending aspect is handled by proc mgmt code. For execution, |
| preempts always follow startcores due to the KMSG ordering property. For |
| execution of startcores, startcores always spin until the preempt they follow |
| is complete, ensuring the execution of the main part of their handler happens |
| after the prior preempt. |
| |
| Here's some good ideas for the ordering of locks/irqs/messages: |
| - You can't hold a spinlock of any sort and then wait on a routine kernel |
| message. The core where that runs may be waiting on you, or some scenario |
| like above. |
| - Similarly, think about how this works with kthreads. A kthread |
| restart is a routine KMSG. You shouldn't be waiting on code that |
| could end up kthreading, mostly because those calls block! |
| - You can hold a spinlock and wait on an IMMED kmsg, if the waiters of the |
| spinlock have irqs enabled while spinning (this is what we used to do with |
| the proc lock and IMMED kmsgs, and 54c6008 is an example of doing it wrong) |
| - As a corollary, locks like this cannot be irqsave, since the other |
| attempted locker will have irq disabled |
| - For broadcast trees, you'd have to send IMMEDs for the intermediates, and |
| then it'd be okay to wait on those intermediate, immediate messages (if we |
| wanted confirmation of the posting of RKM) |
| - The main thing any broadcast mechanism needs to do is make sure all |
| messages get delivered in order to particular pcores (the central |
| premise of KMSGs) (and not deadlock due to waiting on a KMSG |
| improperly) |
| - Alternatively, we could use routines for the intermediates if we didn't want |
| to wait for RKMs to hit their destination, we'd need to always use the same |
| proxy for the same destination pcore, e.g., core 16 always covers 16-31. |
| - Otherwise, we couldn't guarantee the ordering of SC before PR before |
| another SC (which the proc_lock and proc mgmt code does); we need the |
| ordering of intermediate msgs on the message queues of a particular |
| core. |
| - All kmsgs would need to use this broadcasting style (couldn't mix |
| regular direct messages with broadcast), so odds are this style would |
| be of limited use. |
| - since we're not waiting on execution of a message, we could use RKMs |
| (while holding a spinlock) |
| - There might be some bad effects with kthreads delaying the reception of RKMS |
| for a while, but probably not catastrophically. |
| |
| 9.6: Things That We Don't Handle Nicely |
| --------------------------- |
| If for some reason a syscall or fault handler blocks *unexpectedly*, we could |
| have issues. Imagine if change_to happens to block in some early syscall code |
| (like instrumentation, or who knows what, that blocks in memory allocation). |
| When the syscall kthread restarts, its old cur_ctx is gone. It may or may not |
| be running on a core owned by the original process. If it was, we probably |
| would accidentally yield that vcore (clearly a bug). |
| |
| For now, any of these calls that care about cur_ctx/pcpui need to not block |
| without some sort of protection. None of them do, but in the future we might |
| do something that causes them to block. We could deal with it by having a |
| pcpu or per-kthread/syscall flag that says if it ever blocked, and possibly |
| abort. We get into similar nasty areas as with preempts, but this time, we |
| can't solve it by making preempt a routine KMSG - we block as part of that |
| syscall/handler code. Odds are, we'll just have to outlaw this, now and |
| forever. Just note that if a syscall/handler blocks, the TF it came in on is |
| probably not cur_ctx any longer, and that old cur_ctx has probably restarted. |
| |
| 10. TBD |
| =========================== |