|  | processes.txt | 
|  | Barret Rhoden | 
|  |  | 
|  | All things processes!  This explains processes from a high level, especially | 
|  | focusing on the user-kernel boundary and transitions to the many-core state, | 
|  | which is the way in which parallel processes run.  This doesn't discuss deep | 
|  | details of the ROS kernel's process code. | 
|  |  | 
|  | This is motivated by two things: kernel scalability and direct support for | 
|  | parallel applications. | 
|  |  | 
|  | Part 1: Overview | 
|  | Part 2: How They Work | 
|  | Part 3: Resource Requests | 
|  | Part 4: Preemption and Notification | 
|  | Part 5: Old Arguments (mostly for archival purposes)) | 
|  | Part 6: Parlab app use cases | 
|  |  | 
|  | Revision History: | 
|  | 2009-10-30 - Initial version | 
|  | 2010-03-04 - Preemption/Notification, changed to many-core processes | 
|  |  | 
|  | Part 1: World View of Processes | 
|  | ================================== | 
|  | A process is the lowest level of control, protection, and organization in the | 
|  | kernel. | 
|  |  | 
|  | 1.1: What's a process? | 
|  | ------------------------------- | 
|  | Features: | 
|  | - They are an executing instance of a program.  A program can load multiple | 
|  | other chunks of code and run them (libraries), but they are written to work | 
|  | with each other, within the same address space, and are in essence one | 
|  | entity. | 
|  | - They have one address space/ protection domain. | 
|  | - They run in Ring 3 / Usermode. | 
|  | - They can interact with each other, subject to permissions enforced by the | 
|  | kernel. | 
|  | - They can make requests from the kernel, for things like resource guarantees. | 
|  | They have a list of resources that are given/leased to them. | 
|  |  | 
|  | None of these are new.  Here's what's new: | 
|  | - They can run in a many-core mode, where its cores run at the same time, and | 
|  | it is aware of changes to these conditions (page faults, preemptions).  It | 
|  | can still request more resources (cores, memory, whatever). | 
|  | - Every core in a many-core process (MCP) is *not* backed by a kernel | 
|  | thread/kernel stack, unlike with Linux tasks. | 
|  | - There are *no* per-core run-queues in the kernel that decide for | 
|  | themselves which kernel thread to run. | 
|  | - They are not fork()/execed().  They are created(), and then later made | 
|  | runnable.  This allows the controlling process (parent) to do whatever it | 
|  | wants: pass file descriptors, give resources, whatever. | 
|  |  | 
|  | These changes are directly motivated by what is wrong with current SMP | 
|  | operating systems as we move towards many-core: direct (first class) support | 
|  | for truly parallel processes, kernel scalability, and an ability of a process | 
|  | to see through classic abstractions (the virtual processor) to understand (and | 
|  | make requests about) the underlying state of the machine. | 
|  |  | 
|  | 1.2: What's a partition? | 
|  | ------------------------------- | 
|  | So a process can make resource requests, but some part of the system needs to | 
|  | decide what to grant, when to grant it, etc.  This goes by several names: | 
|  | scheduler / resource allocator / resource manager.  The scheduler simply says | 
|  | when you get some resources, then calls functions from lower parts of the | 
|  | kernel to make it happen. | 
|  |  | 
|  | This is where the partitioning of resources comes in.  In the simple case (one | 
|  | process per partitioned block of resources), the scheduler just finds a slot | 
|  | and runs the process, giving it its resources. | 
|  |  | 
|  | A big distinction is that the *partitioning* of resources only makes sense | 
|  | from the scheduler on up in the stack (towards userspace).  The lower levels | 
|  | of the kernel know about resources that are granted to a process.  The | 
|  | partitioning is about the accounting of resources and an interface for | 
|  | adjusting their allocation.  It is a method for telling the 'scheduler' how | 
|  | you want resources to be granted to processes. | 
|  |  | 
|  | A possible interface for this is procfs, which has a nice hierarchy. | 
|  | Processes can be grouped together, and resources can be granted to them.  Who | 
|  | does this?  A process can create it's own directory entry (a partition), and | 
|  | move anyone it controls (parent of, though that's not necessary) into its | 
|  | partition or a sub-partition.  Likewise, a sysadmin/user can simply move PIDs | 
|  | around in the tree, creating partitions consisting of processes completely | 
|  | unaware of each other. | 
|  |  | 
|  | Now you can say things like "give 25% of the system's resources to apache and | 
|  | mysql".  They don't need to know about each other.  If you want finer-grained | 
|  | control, you can create subdirectories (subpartitions), and give resources on | 
|  | a per-process basis.  This is back to the simple case of one process for one | 
|  | (sub)partition. | 
|  |  | 
|  | This is all influenced by Linux's cgroups (process control groups). | 
|  | http://www.mjmwired.net/kernel/Documentation/cgroups.txt. They group processes | 
|  | together, and allow subsystems to attach meaning to those groups. | 
|  |  | 
|  | Ultimately, I view partitioning as something that tells the kernel how to | 
|  | grant resources.  It's an abstraction presented to userspace and higher levels | 
|  | of the kernel.  The specifics still need to be worked out, but by separating | 
|  | them from the process abstraction, we can work it out and try a variety of | 
|  | approaches. | 
|  |  | 
|  | The actual granting of resources and enforcement is done by the lower levels | 
|  | of the kernel (or by hardware, depending on future architectural changes). | 
|  |  | 
|  | Part 2: How They Work | 
|  | =============================== | 
|  | 2.1: States | 
|  | ------------------------------- | 
|  | PROC_CREATED | 
|  | PROC_RUNNABLE_S | 
|  | PROC_RUNNING_S | 
|  | PROC_WAITING | 
|  | PROC_DYING | 
|  | PROC_DYING_ABORT | 
|  | PROC_RUNNABLE_M | 
|  | PROC_RUNNING_M | 
|  |  | 
|  | Difference between the _M and the _S states: | 
|  | - _S : legacy process mode.  There is no need for a second-level scheduler, and | 
|  | the code running is analogous to a user-level thread. | 
|  | - RUNNING_M implies *guaranteed* core(s).  You can be a single core in the | 
|  | RUNNING_M state.  The guarantee is subject to time slicing, but when you | 
|  | run, you get all of your cores. | 
|  | - The time slicing is at a coarser granularity for _M states.  This means that | 
|  | when you run an _S on a core, it should be interrupted/time sliced more | 
|  | often, which also means the core should be classified differently for a | 
|  | while.  Possibly even using it's local APIC timer. | 
|  | - A process in an _M state will be informed about changes to its state, e.g., | 
|  | will have a handler run in the event of a page fault | 
|  |  | 
|  | For more details, check out kern/inc/process.h  For valid transitions between | 
|  | these, check out kern/src/process.c's proc_set_state(). | 
|  |  | 
|  | 2.2: Creation and Running | 
|  | ------------------------------- | 
|  | Unlike the fork-exec model, processes are created, and then explicitly made | 
|  | runnable.  In the time between creation and running, the parent (or another | 
|  | controlling process) can do whatever it wants with the child, such as pass | 
|  | specific file descriptors, map shared memory regions (which can be used to | 
|  | pass arguments). | 
|  |  | 
|  | New processes are not a copy-on-write version of the parent's address space. | 
|  | Due to our changes in the threading model, we no longer need (or want) this | 
|  | behavior left over from the fork-exec model. | 
|  |  | 
|  | By splitting the creation from the running and by explicitly sharing state | 
|  | between processes (like inherited file descriptors), we avoid a lot of | 
|  | concurrency and security issues. | 
|  |  | 
|  | 2.3: Vcoreid vs Pcoreid | 
|  | ------------------------------- | 
|  | The vcoreid is a virtual cpu number.  Its purpose is to provide an easy way | 
|  | for the kernel and userspace to talk about the same core.  pcoreid (physical) | 
|  | would also work.  The vcoreid makes things a little easier, such as when a | 
|  | process wants to refer to one of its other cores (not the calling core).  It | 
|  | also makes the event notification mechanisms easier to specify and maintain. | 
|  |  | 
|  | Processes that care about locality should check what their pcoreid is.  This | 
|  | is currently done via sys_getcpuid().  The name will probably change. | 
|  |  | 
|  | 2.4: Transitioning to and from states | 
|  | ------------------------------- | 
|  | 2.4.1: To go from _S to _M, a process requests cores. | 
|  | -------------- | 
|  | A resource request from 0 to 1 or more causes a transition from _S to _M.  The | 
|  | calling context is saved in the uthread slot (uthread_ctx) in vcore0's | 
|  | preemption data (in procdata).  The second level scheduler needs to be able to | 
|  | restart the context when vcore0 starts up.  To do this, it will need to save the | 
|  | TLS/TCB descriptor and the floating point/silly state (if applicable) in the | 
|  | user-thread control block, and do whatever is needed to signal vcore0 to run the | 
|  | _S context when it starts up.  One way would be to mark vcore0's "active thread" | 
|  | variable to point to the _S thread.  When vcore0 starts up at | 
|  | _start/vcore_entry() (like all vcores), it will see a thread was running there | 
|  | and restart it.  The kernel will migrate the _S thread's silly state (FP) to the | 
|  | new pcore, so that it looks like the process was simply running the _S thread | 
|  | and got notified.  Odds are, it will want to just restart that thread, but the | 
|  | kernel won't assume that (hence the notification). | 
|  |  | 
|  | In general, all cores (and all subsequently allocated cores) start at the elf | 
|  | entry point, with vcoreid in eax or a suitable arch-specific manner.  There is | 
|  | also a syscall to get the vcoreid, but this will save an extra trap at vcore | 
|  | start time. | 
|  |  | 
|  | Future proc_runs(), like from RUNNABLE_M to RUNNING_M start all cores at the | 
|  | entry point, including vcore0.  The saving of a _S context to vcore0's | 
|  | uthread_ctx only happens on the transition from _S to _M (which the process | 
|  | needs to be aware of for a variety of reasons).  This also means that userspace | 
|  | needs to handle vcore0 coming up at the entry point again (and not starting the | 
|  | program over).  This is currently done in sysdeps-ros/start.c, via the static | 
|  | variable init.  Note there are some tricky things involving dynamically linked | 
|  | programs, but it all works currently. | 
|  |  | 
|  | When coming in to the entry point, whether as the result of a startcore or a | 
|  | notification, the kernel will set the stack pointer to whatever is requested | 
|  | by userspace in procdata.  A process should allocate stacks of whatever size | 
|  | it wants for its vcores when it is in _S mode, and write these location to | 
|  | procdata.  These stacks are the transition stacks (in Lithe terms) that are | 
|  | used as jumping-off points for future function calls.  These stacks need to be | 
|  | used in a continuation-passing style, and each time they are used, they start | 
|  | from the top. | 
|  |  | 
|  | 2.4.2: To go from _M to _S, a process requests 0 cores | 
|  | -------------- | 
|  | The caller becomes the new _S context.  Everyone else gets trashed | 
|  | (abandon_core()).  Their stacks are still allocated and it is up to userspace | 
|  | to deal with this.  In general, they will regrab their transition stacks when | 
|  | they come back up.  Their other stacks and whatnot (like TBB threads) need to | 
|  | be dealt with. | 
|  |  | 
|  | When the caller next switches to _M, that context (including its stack) | 
|  | maintains its old vcore identity.  If vcore3 causes the switch to _S mode, it | 
|  | ought to remain vcore3 (lots of things get broken otherwise). | 
|  | As of March 2010, the code does not reflect this.  Don't rely on anything in | 
|  | this section for the time being. | 
|  |  | 
|  | 2.4.3: Requesting more cores while in _M | 
|  | -------------- | 
|  | Any core can request more cores and adjust the resource allocation in any way. | 
|  | These new cores come up just like the original new cores in the transition | 
|  | from _S to _M: at the entry point. | 
|  |  | 
|  | 2.4.4: Yielding | 
|  | -------------- | 
|  | sys_yield()/proc_yield() will give up the calling core, and may or may not | 
|  | adjust the desired number of cores, subject to its parameters.  Yield is | 
|  | performing two tasks, both of which result in giving up the core.  One is for | 
|  | not wanting the core anymore.  The other is in response to a preemption.  Yield | 
|  | may not be called remotely (ARSC). | 
|  |  | 
|  | In _S mode, it will transition from RUNNING_S to RUNNABLE_S.  The context is | 
|  | saved in scp_ctx. | 
|  |  | 
|  | In _M mode, this yields the calling core.  A yield will *not* transition from _M | 
|  | to _S.  The kernel will rip it out of your vcore list.  A process can yield its | 
|  | cores in any order.  The kernel will "fill in the holes of the vcoremap" for any | 
|  | future new cores requested (e.g., proc A has 4 vcores, yields vcore2, and then | 
|  | asks for another vcore.  The new one will be vcore2).  When any core starts in | 
|  | _M mode, even after a yield, it will come back at the vcore_entry()/_start point. | 
|  |  | 
|  | Yield will normally adjust your desired amount of vcores to the amount after the | 
|  | calling core is taken.  This is the way a process gives its cores back. | 
|  |  | 
|  | Yield can also be used to say the process is just giving up the core in response | 
|  | to a pending preemption, but actually wants the core and does not want resource | 
|  | requests to be readjusted.  For example, in the event of a preemption | 
|  | notification, a process may yield (ought to!) so that the kernel does not need | 
|  | to waste effort with full preemption.  This is done by passing in a bool | 
|  | (being_nice), which signals the kernel that it is in response to a preemption. | 
|  | The kernel will not readjust the amt_wanted, and if there is no preemption | 
|  | pending, the kernel will ignore the yield. | 
|  |  | 
|  | There may be an m_yield(), which will yield all or some of the cores of an MPC, | 
|  | remotely.  This is discussed farther down a bit.  It's not clear what exactly | 
|  | it's purpose would be. | 
|  |  | 
|  | We also haven't addressed other reasons to yield, or more specifically to wait, | 
|  | such as for an interrupt or an event of some sort. | 
|  |  | 
|  | 2.4.5: Others | 
|  | -------------- | 
|  | There are other transitions, mostly self-explanatory.  We don't currently use | 
|  | any WAITING states, since we have nothing to block on yet.  DYING is a state | 
|  | when the kernel is trying to kill your process, which can take a little while | 
|  | to clean up. | 
|  |  | 
|  | Part 3: Resource Requests | 
|  | =============================== | 
|  | A process can ask for resources from the kernel.  The kernel either grants | 
|  | these requests or not, subject to QoS guarantees, or other scheduler-related | 
|  | criteria. | 
|  |  | 
|  | A process requests resources, currently via sys_resource_req.  The form of a | 
|  | request is to tell the kernel how much of a resource it wants.  Currently, | 
|  | this is the amt_wanted.  We'll also have a minimum amount wanted, which tells | 
|  | the scheduler not to run the process until the minimum amount of resources are | 
|  | available. | 
|  |  | 
|  | How the kernel actually grants resources is resource-specific.  In general, | 
|  | there are functions like proc_give_cores() (which gives certain cores to a | 
|  | process) that actually does the allocation, as well as adjusting the | 
|  | amt_granted for that resource. | 
|  |  | 
|  | For expressing QoS guarantees, we'll probably use something like procfs (as | 
|  | mentioned above) to explicitly tell the scheduler/resource manager what the | 
|  | user/sysadmin wants.  An interface like this ought to be usable both by | 
|  | programs as well as simple filesystem tools (cat, etc). | 
|  |  | 
|  | Guarantees exist regardless of whether or not the allocation has happened.  An | 
|  | example of this is when a process may be guaranteed to use 8 cores, but | 
|  | currently only needs 2.  Whenever it asks for up to 8 cores, it will get them. | 
|  | The exact nature of the guarantee is TBD, but there will be some sort of | 
|  | latency involved in the guarantee for systems that want to take advantage of | 
|  | idle resources (compared to simply reserving and not allowing anyone else to | 
|  | use them).  A latency of 0 would mean a process wants it instantly, which | 
|  | probably means they ought to be already allocated (and billed to) that | 
|  | process. | 
|  |  | 
|  | Part 4: Preemption and Event Notification | 
|  | =============================== | 
|  | Preemption and Notification are tied together.  Preemption is when the kernel | 
|  | takes a resource (specifically, cores).  There are two types core_preempt() | 
|  | (one core) and gang_preempt() (all cores).  Notification (discussed below) is | 
|  | when the kernel informs a process of an event, usually referring to the act of | 
|  | running a function on a core (active notification). | 
|  |  | 
|  | The rough plan for preemption is to notify beforehand, then take action if | 
|  | userspace doesn't yield.  This is a notification a process can ignore, though | 
|  | it is highly recommended to at least be aware of impending core_preempt() | 
|  | events. | 
|  |  | 
|  | 4.1: Notification Basics | 
|  | ------------------------------- | 
|  | One of the philosophical goals of ROS is to expose information up to userspace | 
|  | (and allow requests based on that information).  There will be a variety of | 
|  | events in the system that processes will want to know about.  To handle this, | 
|  | we'll eventually build something like the following. | 
|  |  | 
|  | All events will have a number, like an interrupt vector.  Each process will | 
|  | have an event queue (per core, described below).  On most architectures, it | 
|  | will be a simple producer-consumer ring buffer sitting in the "shared memory" | 
|  | procdata region (shared between the kernel and userspace).  The kernel writes | 
|  | a message into the buffer with the event number and some other helpful | 
|  | information. | 
|  |  | 
|  | Additionally, the process may request to be actively notified of specific | 
|  | events.  This is done by having the process write into an event vector table | 
|  | (like an IDT) in procdata.  For each event, the process writes the vcoreid it | 
|  | wants to be notified on. | 
|  |  | 
|  | 4.2: Notification Specifics | 
|  | ------------------------------- | 
|  | In procdata there is an array of per-vcore data, holding some | 
|  | preempt/notification information and space for two trapframes: one for | 
|  | notification and one for preemption. | 
|  |  | 
|  | 4.2.1: Overall | 
|  | ----------------------------- | 
|  | When a notification arrives to a process under normal circumstances, the | 
|  | kernel places the previous running context in the notification trapframe, and | 
|  | returns to userspace at the program entry point (the elf entry point) on the | 
|  | transition stack.  If a process is already handling a notification on that | 
|  | core, the kernel will not interrupt it.  It is the processes's responsibility | 
|  | to check for more notifications before returning to its normal work.  The | 
|  | process must also unmask notifications (in procdata) before it returns to do | 
|  | normal work.  Unmasking notifications is the signal to the kernel to not | 
|  | bother sending IPIs, and if an IPI is sent before notifications are masked, | 
|  | then the kernel will double-check this flag to make sure interrupts should | 
|  | have arrived. | 
|  |  | 
|  | Notification unmasking is done by clearing the notif_disabled flag (similar to | 
|  | turning interrupts on in hardware).  When a core starts up, this flag is on, | 
|  | meaning that notifications are disabled by default.  It is the process's | 
|  | responsibility to turn on notifications for a given vcore. | 
|  |  | 
|  | 4.2.2: Notif Event Details | 
|  | ----------------------------- | 
|  | When the process runs the handler, it is actually starting up at the same | 
|  | location in code as it always does.  To determine if it was a notification or | 
|  | not, simply check the queue and bitmask.  This has the added benefit of allowing | 
|  | a process to notice notifications that it missed previously, or notifs it wanted | 
|  | without active notification (IPI).  If we want to bypass this check by having a | 
|  | magic register signal, we can add that later.  Additionally, the kernel will | 
|  | mask notifications (much like an x86 interrupt gate).  It will also mask | 
|  | notifications when starting a core with a fresh trapframe, since the process | 
|  | will be executing on its transition stack.  The process must check its per-core | 
|  | event queue to see why it was called, and deal with all of the events on the | 
|  | queue.  In the case where the event queue overflows, the kernel will up a | 
|  | counter so the process can at least be aware things are missed.  At the very | 
|  | least, the process will see the notification marked in a bitmask. | 
|  |  | 
|  | These notification events include things such as: an IO is complete, a | 
|  | preemption is pending to this core, the process just returned from a | 
|  | preemption, there was a trap (divide by 0, page fault), and many other things. | 
|  | We plan to allow this list to grow at runtime (a process can request new event | 
|  | notification types).  These messages will often need some form of a timestamp, | 
|  | especially ones that will expire in meaning (such as a preempt_pending). | 
|  |  | 
|  | Note that only one notification can be active at a time, including a fault. | 
|  | This means that if a process page faults or something while notifications are | 
|  | masked, the process will simply be killed.    It is up to the process to make | 
|  | sure the appropriate pages are pinned, which it should do before entering _M | 
|  | mode. | 
|  |  | 
|  | 4.2.3: Event Overflow and Non-Messages | 
|  | ----------------------------- | 
|  | For missed/overflowed events, and for events that do not need messages (they | 
|  | have no parameters and multiple notifications are irrelevant), the kernel will | 
|  | toggle that event's bit in a bitmask.  For the events that don't want messages, | 
|  | we may have a flag that userspace sets, meaning they just want to know it | 
|  | happened.  This might be too much of a pain, so we'll see.  For notification | 
|  | events that overflowed the queue, the parameters will be lost, but hopefully the | 
|  | application can sort it out.  Again, we'll see.  A specific notif_event should | 
|  | not appear in both the event buffers and in the bitmask. | 
|  |  | 
|  | It does not make sense for all events to have messages.  Others, it does not | 
|  | make sense to specify a different core on which to run the handler (e.g. page | 
|  | faults).  The notification methods that the process expresses via procdata are | 
|  | suggestions to the kernel.  When they don't make sense, they will be ignored. | 
|  | Some notifications might be unserviceable without messages.  A process needs to | 
|  | have a fallback mechanism.  For example, they can read the vcoremap to see who | 
|  | was lost, or they can restart a thread to cause it to page fault again. | 
|  |  | 
|  | Event overflow sucks - it leads to a bunch of complications.  Ultimately, what | 
|  | we really want is a limitless amount of notification messages (per core), as | 
|  | well as a limitless amount of notification types.  And we want these to be | 
|  | relayed to userspace without trapping into the kernel. | 
|  |  | 
|  | We could do this if we had a way to dynamically manage memory in procdata, with | 
|  | a distrusted process on one side of the relationship.  We could imagine growing | 
|  | procdata dynamically (we plan to, mostly to grow the preempt_data struct as we | 
|  | request more vcores), and then run some sort of heap manager / malloc.  Things | 
|  | get very tricky since the kernel should never follow pointers that userspace can | 
|  | touch.  Additionally, whatever memory management we use becomes a part of the | 
|  | kernel interface. | 
|  |  | 
|  | Even if we had that, dynamic notification *types* is tricky - they are | 
|  | identified by a number, not by a specific (list) element. | 
|  |  | 
|  | For now, this all seems like an unnecessary pain in the ass.  We might adjust it | 
|  | in the future if we come up with clean, clever ways to deal with the problem, | 
|  | which we aren't even sure is a problem yet. | 
|  |  | 
|  | 4.2.4: How to Use and Leave a Transition Stack | 
|  | ----------------------------- | 
|  | We considered having the kernel be aware of a process's transition stacks and | 
|  | sizes so that it can detect if a vcore is in a notification handler based on | 
|  | the stack pointer in the trapframe when a trap or interrupt fires.  While | 
|  | cool, the flag for notif_disabled is much easier and just as capable. | 
|  | Userspace needs to be aware of various races, and only enable notifications | 
|  | when it is ready to have its transition stack clobbered.  This means that when | 
|  | switching from big user-thread to user-thread, the process should temporarily | 
|  | disable notifications and reenable them before starting the new thread fully. | 
|  | This is analogous to having a kernel that disables interrupts while in process | 
|  | context. | 
|  |  | 
|  | A process can fake not being on its transition stack, and even unmapping their | 
|  | stack.  At worst, a vcore could recursively page fault (the kernel does not | 
|  | know it is in a handler, if they keep enabling notifs before faulting), and | 
|  | that would continue til the core is forcibly preempted.  This is not an issue | 
|  | for the kernel. | 
|  |  | 
|  | When a process wants to use its transition stack, it ought to check | 
|  | preempt_pending, mask notifications, jump to its transition stack, do its work | 
|  | (e.g. process notifications, check for new notifications, schedule a new | 
|  | thread) periodically checking for a pending preemption, and making sure the | 
|  | notification queue/list is empty before moving back to real code.  Then it | 
|  | should jump back to a real stack, unmask notifications, and jump to the newly | 
|  | scheduled thread. | 
|  |  | 
|  | This can be really tricky.  When userspace is changing threads, it will need to | 
|  | unmask notifs as well as jump to the new thread.  There is a slight race here, | 
|  | but it is okay.  The race is that an IPI can arrive after notifs are unmasked, | 
|  | but before returning to the real user thread.  Then the code will think the | 
|  | uthread_ctx represents the new user thread, even though it hasn't started (and | 
|  | the PC is wrong).  The trick is to make sure that all state required to start | 
|  | the new thread, as well as future instructions, are all saved within the "stuff" | 
|  | that gets saved in the uthread_ctx.  When these threading packages change | 
|  | contexts, they ought to push the PC on the stack of the new thread, (then enable | 
|  | notifs) and then execute a return.  If an IPI arrives before the "function | 
|  | return", then when that context gets restarted, it will run the "return" with | 
|  | the appropriate value on the stack still. | 
|  |  | 
|  | There is a further complication.  The kernel can send an IPI that the process | 
|  | wanted, but the vcore did not get truly interrupted since its notifs were | 
|  | disabled.  There is a race between checking the queue/bitmask and then enabling | 
|  | notifications.  The way we deal with it is that the kernel posts the | 
|  | message/bit, then sets notif_pending.  Then it sends the IPI, which may or may | 
|  | not be received (based on notif_disabled).  (Actually, the kernel only ought to | 
|  | send the IPI if notif_pending was 0 (atomically) and notif_disabled is 0).  When | 
|  | leaving the transition stack, userspace should clear the notif_pending, then | 
|  | check the queue do whatever, and then try to pop the tf.  When popping the tf, | 
|  | after enabling notifications, check notif_pending.  If it is still clear, return | 
|  | without fear of missing a notif.  If it is not clear, it needs to manually | 
|  | notify itself (sys_self_notify) so that it can process the notification that it | 
|  | missed and for which it wanted to receive an IPI.  Before it does this, it needs | 
|  | to clear notif_pending, so the kernel will send it an IPI.  These last parts are | 
|  | handled in pop_user_ctx(). | 
|  |  | 
|  | 4.3: Preemption Specifics | 
|  | ------------------------------- | 
|  | There's an issue with a preempted vcore getting restarted while a remote core | 
|  | tries to restart that context.  They resolve this fight with a variety of VC | 
|  | flags (VC_UTHREAD_STEALING).  Check out handle_preempt() in uthread.c. | 
|  |  | 
|  | 4.4: Other trickiness | 
|  | ------------------------------- | 
|  | Take all of these with a grain of salt - it's quite old. | 
|  |  | 
|  | 4.4.1: Preemption -> deadlock | 
|  | ------------------------------- | 
|  | One issue is that a context can be holding a lock that is necessary for the | 
|  | userspace scheduler to manage preempted threads, and this context can be | 
|  | preempted.  This would deadlock the scheduler.  To assist a process from | 
|  | locking itself up, the kernel will toggle a preempt_pending flag in | 
|  | procdata for that vcore before sending the actual preemption.  Whenever the | 
|  | scheduler is grabbing one of these critical spinlocks, it needs to check that | 
|  | flag first, and yield if a preemption is coming in. | 
|  |  | 
|  | Another option we may implement is for the process to be able to signal to the | 
|  | kernel that it is in one of these ultra-critical sections by writing a magic | 
|  | value to a specific register in the trapframe.  If there kernel sees this, it | 
|  | will allow the process to run for a little longer.  The issue with this is | 
|  | that the kernel would need to assume processes will always do this (malicious | 
|  | ones will) and add this extra wait time to the worst case preemption time. | 
|  |  | 
|  | Finally, a scheduler could try to use non-blocking synchronization (no | 
|  | spinlocks), or one of our other long-term research synchronization methods to | 
|  | avoid deadlock, though we realize this is a pain for userspace for now.  FWIW, | 
|  | there are some OSs out there with only non-blocking synchronization (I think). | 
|  |  | 
|  | 4.4.2: Cascading and overflow | 
|  | ------------------------------- | 
|  | There used to be issues with cascading interrupts (when contexts are still | 
|  | running handlers).  Imagine a pagefault, followed by preempting the handler. | 
|  | It doesn't make sense to run the preempt context after the page fault. | 
|  | Earlier designs had issues where it was hard for a vcore to determine the | 
|  | order of events and unmixing preemption, notification, and faults.  We deal | 
|  | with this by having separate slots for preemption and notification, and by | 
|  | treating faults as another form of notification.  Faulting while handling a | 
|  | notification just leads to death.  Perhaps there is a better way to do that. | 
|  |  | 
|  | Another thing we considered would be to have two stacks - transition for | 
|  | notification and an exception stack for faults.  We'd also need a fault slot | 
|  | for the faulting trapframe.  This begins to take up even more memory, and it | 
|  | is not clear how to handle mixed faults and notifications.  If you fault while | 
|  | on the notification slot, then fine.  But you could fault for other reasons, | 
|  | and then receive a notification.  And then if you fault in that handler, we're | 
|  | back to where we started - might as well just kill them. | 
|  |  | 
|  | Another issue was overload.  Consider if vcore0 is set up to receive all | 
|  | events.  If events come in faster than it can process them, it will both nest | 
|  | too deep and process out of order.  To handle this, we only notify once, and | 
|  | will not send future active notifications / interrupts until the process | 
|  | issues an "end of interrupt" (EOI) for that vcore.  This is modelled after | 
|  | hardware interrupts (on x86, at least). | 
|  |  | 
|  | 4.4.3: Restarting a Preempted Notification | 
|  | ------------------------------- | 
|  | Nowadays, to restart a preempted notification, you just restart the vcore. | 
|  | The kernel does, either if it gives the process more cores or if userspace asked | 
|  | it to with a sys_change_vcore(). | 
|  |  | 
|  | 4.4.4: Userspace Yield Races | 
|  | ------------------------------- | 
|  | Imagine a vcore realizes it is getting preempted soon, so it starts to yield. | 
|  | However, it is too slow and doesn't make it into the kernel before a preempt | 
|  | message takes over.  When that vcore is run again, it will continue where it | 
|  | left off and yield its core.  The desired outcome is for yield to fail, since | 
|  | the process doesn't really want to yield that core.  To sort this out, yield | 
|  | will take a parameter saying that the yield is in response to a pending | 
|  | preemption.  If the phase is over (preempted and returned), the call will not | 
|  | yield and simply return to userspace. | 
|  |  | 
|  | 4.4.5: Userspace m_yield | 
|  | ------------------------------- | 
|  | There are a variety of ways to implement an m_yield (yield the entire MCP). | 
|  | We could have a "no niceness" yield - just immediately preempt, but there is a | 
|  | danger of the locking business.  We could do the usual delay game, though if | 
|  | userspace is requesting its yield, arguably we don't need to give warning. | 
|  |  | 
|  | Another approach would be to not have an explicit m_yield call.  Instead, we | 
|  | can provide a notify_all call, where the notification sent to every vcore is | 
|  | to yield.  I imagine we'll have a notify_all (or rather, flags to the notify | 
|  | call) anyway, so we can do this for now. | 
|  |  | 
|  | The fastest way will probably be the no niceness way.  One way to make this | 
|  | work would be for vcore0 to hold all of the low-level locks (from 4.4.1) and | 
|  | manually unlock them when it wakes up.  Yikes! | 
|  |  | 
|  | 4.5: Random Other Stuff | 
|  | ------------------------------- | 
|  | Pre-Notification issues: how much time does userspace need to clean up and | 
|  | yield?  How quickly does the kernel need the core back (for scheduling | 
|  | reasons)? | 
|  |  | 
|  | Part 5: Old Arguments about Processes vs Partitions | 
|  | =============================== | 
|  | This is based on my interpretation of the cell (formerly what I thought was | 
|  | called a partition). | 
|  |  | 
|  | 5.1: Program vs OS | 
|  | ------------------------------- | 
|  | A big difference is what runs inside the object.  I think trying to support | 
|  | OS-like functionality is a quick path to unnecessary layers and complexity, | 
|  | esp for the common case.  This leads to discussions of physical memory | 
|  | management, spawning new programs, virtualizing HW, shadow page tables, | 
|  | exporting protection rings, etc. | 
|  |  | 
|  | This unnecessarily brings in the baggage and complexity of supporting VMs, | 
|  | which are a special case.  Yes, we want processes to be able to use their | 
|  | resources, but I'd rather approach this from the perspective of "what do they | 
|  | need?" than "how can we make it look like a real machine."  Virtual machines | 
|  | are cool, and paravirtualization influenced a lot of my ideas, but they have | 
|  | their place and I don't think this is it. | 
|  |  | 
|  | For example, exporting direct control of physical pages is a bad idea.  I | 
|  | wasn't clear if anyone was advocating this or not.  By exposing actual machine | 
|  | physical frames, we lose our ability to do all sorts of things (like swapping, | 
|  | for all practical uses, and other VM tricks).  If the cell/process thinks it | 
|  | is manipulating physical pages, but really isn't, we're in the VM situation of | 
|  | managing nested or shadow page tables, which we don't want. | 
|  |  | 
|  | For memory, we'd be better off giving an allocation of a quantity frames, not | 
|  | specific frames.  A process can pin up to X pages, for instance.  It can also | 
|  | pick pages to be evicted when there's memory pressure.  There are already | 
|  | similar ideas out there, both in POSIX and in ACPM. | 
|  |  | 
|  | Instead of mucking with faking multiple programs / entities within an cell, | 
|  | just make more processes.  Otherwise, you'd have to export weird controls that | 
|  | the kernel is doing anyway (and can do better!), and have complicated middle | 
|  | layers. | 
|  |  | 
|  | 5.2: Multiple "Things" in a "partition" | 
|  | ------------------------------- | 
|  | In the process-world, the kernel can make a distinction between different | 
|  | entities that are using a block of resources.  Yes, "you" can still do | 
|  | whatever you want with your resources.  But the kernel directly supports | 
|  | useful controls that you want. | 
|  | - Multiple protection domains are no problem.  They are just multiple | 
|  | processes.  Resource allocation is a separate topic. | 
|  | - Processes can control one another, based on a rational set of rules.  Even | 
|  | if you have just cells, we still need them to be able to control one another | 
|  | (it's a sysadmin thing). | 
|  |  | 
|  | "What happens in a cell, stays in a cell."  What does this really mean?  If | 
|  | it's about resource allocation and passing of resources around, we can do that | 
|  | with process groups.  If it's about the kernel not caring about what code runs | 
|  | inside a protection domain, a process provides that.  If it's about a "parent" | 
|  | program trying to control/kill/whatever a "child" (even if it's within a cell, | 
|  | in the cell model), you *want* the kernel to be involved.  The kernel is the | 
|  | one that can do protection between entities. | 
|  |  | 
|  | 5.3: Other Things | 
|  | ------------------------------- | 
|  | Let the kernel do what it's made to do, and in the best position to do: manage | 
|  | protection and low-level resources. | 
|  |  | 
|  | Both processes and partitions "have" resources.  They are at different levels | 
|  | in the system.  A process actually gets to use the resources.  A partition is | 
|  | a collection of resources allocated to one or more processes. | 
|  |  | 
|  | In response to this: | 
|  |  | 
|  | On 2009-09-15 at 22:33 John Kubiatowicz wrote: | 
|  | > John Shalf wrote: | 
|  | > > | 
|  | > > Anyhow, Barret is asking that resource requirements attributes be | 
|  | > > assigned on a process basis rather than partition basis.  We need | 
|  | > > to justify why gang scheduling of a partition and resource | 
|  | > > management should be linked. | 
|  |  | 
|  | I want a process to be aware of it's specific resources, as well as the other | 
|  | members of it's partition.  An individual process (which is gang-scheduled in | 
|  | many-core mode) has a specific list of resources.  Its just that the overall | 
|  | 'partition of system resources' is separate from the list of specific | 
|  | resources of a process, simply because there can be many processes under the | 
|  | same partition (collection of resources allocated). | 
|  |  | 
|  | > > | 
|  | > Simplicity! | 
|  | > | 
|  | > Yes, we can allow lots of options, but at the end of the day, the | 
|  | > simplest model that does what we need is likely the best. I don't | 
|  | > want us to hack together a frankenscheduler. | 
|  |  | 
|  | My view is also simple in the case of one address space/process per | 
|  | 'partition.'  Extending it to multiple address spaces is simply asking that | 
|  | resources be shared between processes, but without design details that I | 
|  | imagine will be brutally complicated in the Cell model. | 
|  |  | 
|  |  | 
|  | Part 6: Use Cases | 
|  | =============================== | 
|  | 6.1: Matrix Multiply / Trusting Many-core app | 
|  | ------------------------------- | 
|  | The process is created by something (bash, for instance).  It's parent makes | 
|  | it runnable.  The process requests a bunch of cores and RAM.  The scheduler | 
|  | decides to give it a certain amount of resources, which creates it's partition | 
|  | (aka, chunk of resources granted to it's process group, of which it is the | 
|  | only member).  The sysadmin can tweak this allocation via procfs. | 
|  |  | 
|  | The process runs on its cores in it's many-core mode.  It is gang scheduled, | 
|  | and knows how many cores there are.  When the kernel starts the process on | 
|  | it's extra cores, it passes control to a known spot in code (the ELF entry | 
|  | point), with the virtual core id passed as a parameter. | 
|  |  | 
|  | The code runs from a single binary image, eventually with shared | 
|  | object/library support.  It's view of memory is a virtual address space, but | 
|  | it also can see it's own page tables to see which pages are really resident | 
|  | (similar to POSIX's mincore()). | 
|  |  | 
|  | When it comes time to lose a core, or be completely preempted, the process is | 
|  | notified by the OS running a handler of the process's choosing (in userspace). | 
|  | The process can choose what to do (pick a core to yield, prepare to be | 
|  | preempted, etc). | 
|  |  | 
|  | To deal with memory, the process is notified when it page faults, and keeps | 
|  | its core.  The process can pin pages in memory.  If there is memory pressure, | 
|  | the process can tell the kernel which pages to unmap. | 
|  |  | 
|  | This is the simple case. | 
|  |  | 
|  | 6.2: Browser | 
|  | ------------------------------- | 
|  | In this case, a process wants to create multiple protection domains that share | 
|  | the same pool of resources.  Or rather, with it's own allocated resources. | 
|  |  | 
|  | The browser process is created, as above.  It creates, but does not run, it's | 
|  | untrusted children.  The kernel will have a variety of ways a process can | 
|  | "mess with" a process it controls.  So for this untrusted child, the parent | 
|  | can pass (for example), a file descriptor of what to render, "sandbox" that | 
|  | process (only allow a whitelist of syscalls, e.g. can only read and write | 
|  | descriptors it has).  You can't do this easily in the cell model. | 
|  |  | 
|  | The parent can also set up a shared memory mapping / channel with the child. | 
|  |  | 
|  | For resources, the parent can put the child in a subdirectory/ subpartition | 
|  | and give a portion of its resources to that subpartition.  The scheduler will | 
|  | ensure that both the parent and the child are run at the same time, and will | 
|  | give the child process the resources specified.  (cores, RAM, etc). | 
|  |  | 
|  | After this setup, the parent will then make the child "runnable".  This is why | 
|  | we want to separate the creation from the runnability of a process, which we | 
|  | can't do with the fork/exec model. | 
|  |  | 
|  | The parent can later kill the child if it wants, reallocate the resources in | 
|  | the partition (perhaps to another process rendering a more important page), | 
|  | preempt that process, whatever. | 
|  |  | 
|  | 6.3: SMP Virtual Machines | 
|  | ------------------------------- | 
|  | The main issue (regardless of paravirt or full virt), is that what's running | 
|  | on the cores may or may not trust one another.  One solution is to run each | 
|  | VM-core in it's own process (like with Linux's KVM, it uses N tasks (part of | 
|  | one process) for an N-way SMP VM).  The processes set up the appropriate | 
|  | shared memory mapping between themselves early on.  Another approach would be | 
|  | to allow a many-cored process to install specific address spaces on each | 
|  | core, and interpose on syscalls, privileged instructions, and page faults. | 
|  | This sounds very much like the Cell approach, which may be fine for a VM, but | 
|  | not for the general case of a process. | 
|  |  | 
|  | Or with a paravirtualized SMP guest, you could (similar to the L4Linux way,) | 
|  | make any Guest OS processes actual processes in our OS.  The resource | 
|  | allocation to the Guest OS partition would be managed by the parent process of | 
|  | the group (which would be running the Guest OS kernel).  We still need to play | 
|  | tricks with syscall redirection. | 
|  |  | 
|  | For full virtualization, we'd need to make use of hardware virtualization | 
|  | instructions. Dealing with the VMEXITs, emulation, and other things is a real | 
|  | pain, but already done.  The long range plan was to wait til the | 
|  | http://v3vee.org/ project supported Intel's instructions and eventually | 
|  | incorporate that. | 
|  |  | 
|  | All of these ways involve subtle and not-so-subtle difficulties.  The | 
|  | Cell-as-OS mode will have to deal with them for the common case, which seems | 
|  | brutal.  And rather unnecessary. |