| processes.txt | 
 | Barret Rhoden | 
 |  | 
 | All things processes!  This explains processes from a high level, especially | 
 | focusing on the user-kernel boundary and transitions to the many-core state, | 
 | which is the way in which parallel processes run.  This doesn't discuss deep | 
 | details of the ROS kernel's process code. | 
 |  | 
 | This is motivated by two things: kernel scalability and direct support for | 
 | parallel applications. | 
 |  | 
 | Part 1: Overview | 
 | Part 2: How They Work | 
 | Part 3: Resource Requests | 
 | Part 4: Preemption and Notification | 
 | Part 5: Old Arguments (mostly for archival purposes)) | 
 | Part 6: Parlab app use cases | 
 |  | 
 | Revision History: | 
 | 2009-10-30 - Initial version | 
 | 2010-03-04 - Preemption/Notification, changed to many-core processes | 
 |  | 
 | Part 1: World View of Processes | 
 | ================================== | 
 | A process is the lowest level of control, protection, and organization in the | 
 | kernel. | 
 |  | 
 | 1.1: What's a process? | 
 | ------------------------------- | 
 | Features: | 
 | - They are an executing instance of a program.  A program can load multiple | 
 |   other chunks of code and run them (libraries), but they are written to work | 
 |   with each other, within the same address space, and are in essence one | 
 |   entity. | 
 | - They have one address space/ protection domain.   | 
 | - They run in Ring 3 / Usermode. | 
 | - They can interact with each other, subject to permissions enforced by the | 
 |   kernel. | 
 | - They can make requests from the kernel, for things like resource guarantees. | 
 |   They have a list of resources that are given/leased to them. | 
 |  | 
 | None of these are new.  Here's what's new: | 
 | - They can run in a many-core mode, where its cores run at the same time, and | 
 |   it is aware of changes to these conditions (page faults, preemptions).  It | 
 |   can still request more resources (cores, memory, whatever). | 
 | - Every core in a many-core process (MCP) is *not* backed by a kernel | 
 |   thread/kernel stack, unlike with Linux tasks. | 
 | 	- There are *no* per-core run-queues in the kernel that decide for | 
 | 	  themselves which kernel thread to run. | 
 | - They are not fork()/execed().  They are created(), and then later made | 
 |   runnable.  This allows the controlling process (parent) to do whatever it | 
 |   wants: pass file descriptors, give resources, whatever. | 
 |  | 
 | These changes are directly motivated by what is wrong with current SMP | 
 | operating systems as we move towards many-core: direct (first class) support | 
 | for truly parallel processes, kernel scalability, and an ability of a process | 
 | to see through classic abstractions (the virtual processor) to understand (and | 
 | make requests about) the underlying state of the machine. | 
 |  | 
 | 1.2: What's a partition? | 
 | ------------------------------- | 
 | So a process can make resource requests, but some part of the system needs to | 
 | decide what to grant, when to grant it, etc.  This goes by several names: | 
 | scheduler / resource allocator / resource manager.  The scheduler simply says | 
 | when you get some resources, then calls functions from lower parts of the | 
 | kernel to make it happen. | 
 |  | 
 | This is where the partitioning of resources comes in.  In the simple case (one | 
 | process per partitioned block of resources), the scheduler just finds a slot | 
 | and runs the process, giving it its resources.   | 
 |  | 
 | A big distinction is that the *partitioning* of resources only makes sense | 
 | from the scheduler on up in the stack (towards userspace).  The lower levels | 
 | of the kernel know about resources that are granted to a process.  The | 
 | partitioning is about the accounting of resources and an interface for | 
 | adjusting their allocation.  It is a method for telling the 'scheduler' how | 
 | you want resources to be granted to processes. | 
 |  | 
 | A possible interface for this is procfs, which has a nice hierarchy. | 
 | Processes can be grouped together, and resources can be granted to them.  Who | 
 | does this?  A process can create it's own directory entry (a partition), and | 
 | move anyone it controls (parent of, though that's not necessary) into its | 
 | partition or a sub-partition.  Likewise, a sysadmin/user can simply move PIDs | 
 | around in the tree, creating partitions consisting of processes completely | 
 | unaware of each other. | 
 |  | 
 | Now you can say things like "give 25% of the system's resources to apache and | 
 | mysql".  They don't need to know about each other.  If you want finer-grained | 
 | control, you can create subdirectories (subpartitions), and give resources on | 
 | a per-process basis.  This is back to the simple case of one process for one | 
 | (sub)partition. | 
 |  | 
 | This is all influenced by Linux's cgroups (process control groups). | 
 | http://www.mjmwired.net/kernel/Documentation/cgroups.txt. They group processes | 
 | together, and allow subsystems to attach meaning to those groups. | 
 |  | 
 | Ultimately, I view partitioning as something that tells the kernel how to | 
 | grant resources.  It's an abstraction presented to userspace and higher levels | 
 | of the kernel.  The specifics still need to be worked out, but by separating | 
 | them from the process abstraction, we can work it out and try a variety of | 
 | approaches. | 
 |  | 
 | The actual granting of resources and enforcement is done by the lower levels | 
 | of the kernel (or by hardware, depending on future architectural changes). | 
 |  | 
 | Part 2: How They Work | 
 | =============================== | 
 | 2.1: States | 
 | ------------------------------- | 
 | PROC_CREATED | 
 | PROC_RUNNABLE_S | 
 | PROC_RUNNING_S | 
 | PROC_WAITING | 
 | PROC_DYING | 
 | PROC_RUNNABLE_M | 
 | PROC_RUNNING_M | 
 |  | 
 | Difference between the _M and the _S states: | 
 | - _S : legacy process mode.  There is no need for a second-level scheduler, and | 
 |   the code running is analogous to a user-level thread. | 
 | - RUNNING_M implies *guaranteed* core(s).  You can be a single core in the | 
 |   RUNNING_M state.  The guarantee is subject to time slicing, but when you | 
 |   run, you get all of your cores. | 
 | - The time slicing is at a coarser granularity for _M states.  This means that | 
 |   when you run an _S on a core, it should be interrupted/time sliced more | 
 |   often, which also means the core should be classified differently for a | 
 |   while.  Possibly even using it's local APIC timer. | 
 | - A process in an _M state will be informed about changes to its state, e.g., | 
 |   will have a handler run in the event of a page fault | 
 |  | 
 | For more details, check out kern/inc/process.h  For valid transitions between | 
 | these, check out kern/src/process.c's proc_set_state(). | 
 |  | 
 | 2.2: Creation and Running | 
 | ------------------------------- | 
 | Unlike the fork-exec model, processes are created, and then explicitly made | 
 | runnable.  In the time between creation and running, the parent (or another | 
 | controlling process) can do whatever it wants with the child, such as pass | 
 | specific file descriptors, map shared memory regions (which can be used to | 
 | pass arguments). | 
 |  | 
 | New processes are not a copy-on-write version of the parent's address space. | 
 | Due to our changes in the threading model, we no longer need (or want) this | 
 | behavior left over from the fork-exec model. | 
 |  | 
 | By splitting the creation from the running and by explicitly sharing state | 
 | between processes (like inherited file descriptors), we avoid a lot of | 
 | concurrency and security issues. | 
 |  | 
 | 2.3: Vcoreid vs Pcoreid | 
 | ------------------------------- | 
 | The vcoreid is a virtual cpu number.  Its purpose is to provide an easy way | 
 | for the kernel and userspace to talk about the same core.  pcoreid (physical) | 
 | would also work.  The vcoreid makes things a little easier, such as when a | 
 | process wants to refer to one of its other cores (not the calling core).  It | 
 | also makes the event notification mechanisms easier to specify and maintain. | 
 |  | 
 | Processes that care about locality should check what their pcoreid is.  This | 
 | is currently done via sys_getcpuid().  The name will probably change. | 
 |  | 
 | 2.4: Transitioning to and from states | 
 | ------------------------------- | 
 | 2.4.1: To go from _S to _M, a process requests cores. | 
 | -------------- | 
 | A resource request from 0 to 1 or more causes a transition from _S to _M.  The | 
 | calling context is saved in the uthread slot (uthread_ctx) in vcore0's | 
 | preemption data (in procdata).  The second level scheduler needs to be able to | 
 | restart the context when vcore0 starts up.  To do this, it will need to save the | 
 | TLS/TCB descriptor and the floating point/silly state (if applicable) in the | 
 | user-thread control block, and do whatever is needed to signal vcore0 to run the | 
 | _S context when it starts up.  One way would be to mark vcore0's "active thread" | 
 | variable to point to the _S thread.  When vcore0 starts up at | 
 | _start/vcore_entry() (like all vcores), it will see a thread was running there | 
 | and restart it.  The kernel will migrate the _S thread's silly state (FP) to the | 
 | new pcore, so that it looks like the process was simply running the _S thread | 
 | and got notified.  Odds are, it will want to just restart that thread, but the | 
 | kernel won't assume that (hence the notification). | 
 |  | 
 | In general, all cores (and all subsequently allocated cores) start at the elf | 
 | entry point, with vcoreid in eax or a suitable arch-specific manner.  There is | 
 | also a syscall to get the vcoreid, but this will save an extra trap at vcore | 
 | start time. | 
 |  | 
 | Future proc_runs(), like from RUNNABLE_M to RUNNING_M start all cores at the | 
 | entry point, including vcore0.  The saving of a _S context to vcore0's | 
 | uthread_ctx only happens on the transition from _S to _M (which the process | 
 | needs to be aware of for a variety of reasons).  This also means that userspace | 
 | needs to handle vcore0 coming up at the entry point again (and not starting the | 
 | program over).  This is currently done in sysdeps-ros/start.c, via the static | 
 | variable init.  Note there are some tricky things involving dynamically linked | 
 | programs, but it all works currently. | 
 |  | 
 | When coming in to the entry point, whether as the result of a startcore or a | 
 | notification, the kernel will set the stack pointer to whatever is requested | 
 | by userspace in procdata.  A process should allocate stacks of whatever size | 
 | it wants for its vcores when it is in _S mode, and write these location to | 
 | procdata.  These stacks are the transition stacks (in Lithe terms) that are | 
 | used as jumping-off points for future function calls.  These stacks need to be | 
 | used in a continuation-passing style, and each time they are used, they start | 
 | from the top. | 
 |  | 
 | 2.4.2: To go from _M to _S, a process requests 0 cores | 
 | -------------- | 
 | The caller becomes the new _S context.  Everyone else gets trashed | 
 | (abandon_core()).  Their stacks are still allocated and it is up to userspace | 
 | to deal with this.  In general, they will regrab their transition stacks when | 
 | they come back up.  Their other stacks and whatnot (like TBB threads) need to | 
 | be dealt with. | 
 |  | 
 | When the caller next switches to _M, that context (including its stack) | 
 | maintains its old vcore identity.  If vcore3 causes the switch to _S mode, it | 
 | ought to remain vcore3 (lots of things get broken otherwise). | 
 | As of March 2010, the code does not reflect this.  Don't rely on anything in | 
 | this section for the time being. | 
 |  | 
 | 2.4.3: Requesting more cores while in _M | 
 | -------------- | 
 | Any core can request more cores and adjust the resource allocation in any way. | 
 | These new cores come up just like the original new cores in the transition | 
 | from _S to _M: at the entry point. | 
 |  | 
 | 2.4.4: Yielding | 
 | -------------- | 
 | sys_yield()/proc_yield() will give up the calling core, and may or may not | 
 | adjust the desired number of cores, subject to its parameters.  Yield is | 
 | performing two tasks, both of which result in giving up the core.  One is for | 
 | not wanting the core anymore.  The other is in response to a preemption.  Yield | 
 | may not be called remotely (ARSC). | 
 |  | 
 | In _S mode, it will transition from RUNNING_S to RUNNABLE_S.  The context is | 
 | saved in scp_ctx. | 
 |  | 
 | In _M mode, this yields the calling core.  A yield will *not* transition from _M | 
 | to _S.  The kernel will rip it out of your vcore list.  A process can yield its | 
 | cores in any order.  The kernel will "fill in the holes of the vcoremap" for any | 
 | future new cores requested (e.g., proc A has 4 vcores, yields vcore2, and then | 
 | asks for another vcore.  The new one will be vcore2).  When any core starts in | 
 | _M mode, even after a yield, it will come back at the vcore_entry()/_start point. | 
 |  | 
 | Yield will normally adjust your desired amount of vcores to the amount after the | 
 | calling core is taken.  This is the way a process gives its cores back. | 
 |  | 
 | Yield can also be used to say the process is just giving up the core in response | 
 | to a pending preemption, but actually wants the core and does not want resource | 
 | requests to be readjusted.  For example, in the event of a preemption | 
 | notification, a process may yield (ought to!) so that the kernel does not need | 
 | to waste effort with full preemption.  This is done by passing in a bool | 
 | (being_nice), which signals the kernel that it is in response to a preemption. | 
 | The kernel will not readjust the amt_wanted, and if there is no preemption | 
 | pending, the kernel will ignore the yield. | 
 |  | 
 | There may be an m_yield(), which will yield all or some of the cores of an MPC, | 
 | remotely.  This is discussed farther down a bit.  It's not clear what exactly | 
 | it's purpose would be. | 
 |  | 
 | We also haven't addressed other reasons to yield, or more specifically to wait, | 
 | such as for an interrupt or an event of some sort. | 
 |  | 
 | 2.4.5: Others | 
 | -------------- | 
 | There are other transitions, mostly self-explanatory.  We don't currently use | 
 | any WAITING states, since we have nothing to block on yet.  DYING is a state | 
 | when the kernel is trying to kill your process, which can take a little while | 
 | to clean up. | 
 |  | 
 | Part 3: Resource Requests | 
 | =============================== | 
 | A process can ask for resources from the kernel.  The kernel either grants | 
 | these requests or not, subject to QoS guarantees, or other scheduler-related | 
 | criteria. | 
 |  | 
 | A process requests resources, currently via sys_resource_req.  The form of a | 
 | request is to tell the kernel how much of a resource it wants.  Currently, | 
 | this is the amt_wanted.  We'll also have a minimum amount wanted, which tells | 
 | the scheduler not to run the process until the minimum amount of resources are | 
 | available. | 
 |  | 
 | How the kernel actually grants resources is resource-specific.  In general, | 
 | there are functions like proc_give_cores() (which gives certain cores to a | 
 | process) that actually does the allocation, as well as adjusting the | 
 | amt_granted for that resource. | 
 |  | 
 | For expressing QoS guarantees, we'll probably use something like procfs (as | 
 | mentioned above) to explicitly tell the scheduler/resource manager what the | 
 | user/sysadmin wants.  An interface like this ought to be usable both by | 
 | programs as well as simple filesystem tools (cat, etc). | 
 |  | 
 | Guarantees exist regardless of whether or not the allocation has happened.  An | 
 | example of this is when a process may be guaranteed to use 8 cores, but | 
 | currently only needs 2.  Whenever it asks for up to 8 cores, it will get them. | 
 | The exact nature of the guarantee is TBD, but there will be some sort of | 
 | latency involved in the guarantee for systems that want to take advantage of | 
 | idle resources (compared to simply reserving and not allowing anyone else to | 
 | use them).  A latency of 0 would mean a process wants it instantly, which | 
 | probably means they ought to be already allocated (and billed to) that | 
 | process.   | 
 |  | 
 | Part 4: Preemption and Event Notification | 
 | =============================== | 
 | Preemption and Notification are tied together.  Preemption is when the kernel | 
 | takes a resource (specifically, cores).  There are two types core_preempt() | 
 | (one core) and gang_preempt() (all cores).  Notification (discussed below) is | 
 | when the kernel informs a process of an event, usually referring to the act of | 
 | running a function on a core (active notification). | 
 |  | 
 | The rough plan for preemption is to notify beforehand, then take action if | 
 | userspace doesn't yield.  This is a notification a process can ignore, though | 
 | it is highly recommended to at least be aware of impending core_preempt() | 
 | events. | 
 |  | 
 | 4.1: Notification Basics | 
 | ------------------------------- | 
 | One of the philosophical goals of ROS is to expose information up to userspace | 
 | (and allow requests based on that information).  There will be a variety of | 
 | events in the system that processes will want to know about.  To handle this, | 
 | we'll eventually build something like the following. | 
 |  | 
 | All events will have a number, like an interrupt vector.  Each process will | 
 | have an event queue (per core, described below).  On most architectures, it | 
 | will be a simple producer-consumer ring buffer sitting in the "shared memory" | 
 | procdata region (shared between the kernel and userspace).  The kernel writes | 
 | a message into the buffer with the event number and some other helpful | 
 | information. | 
 |  | 
 | Additionally, the process may request to be actively notified of specific | 
 | events.  This is done by having the process write into an event vector table | 
 | (like an IDT) in procdata.  For each event, the process writes the vcoreid it | 
 | wants to be notified on. | 
 |  | 
 | 4.2: Notification Specifics | 
 | ------------------------------- | 
 | In procdata there is an array of per-vcore data, holding some | 
 | preempt/notification information and space for two trapframes: one for | 
 | notification and one for preemption. | 
 |  | 
 | 4.2.1: Overall | 
 | ----------------------------- | 
 | When a notification arrives to a process under normal circumstances, the | 
 | kernel places the previous running context in the notification trapframe, and | 
 | returns to userspace at the program entry point (the elf entry point) on the | 
 | transition stack.  If a process is already handling a notification on that | 
 | core, the kernel will not interrupt it.  It is the processes's responsibility | 
 | to check for more notifications before returning to its normal work.  The | 
 | process must also unmask notifications (in procdata) before it returns to do | 
 | normal work.  Unmasking notifications is the signal to the kernel to not | 
 | bother sending IPIs, and if an IPI is sent before notifications are masked, | 
 | then the kernel will double-check this flag to make sure interrupts should | 
 | have arrived. | 
 |  | 
 | Notification unmasking is done by clearing the notif_disabled flag (similar to | 
 | turning interrupts on in hardware).  When a core starts up, this flag is on, | 
 | meaning that notifications are disabled by default.  It is the process's | 
 | responsibility to turn on notifications for a given vcore. | 
 |  | 
 | 4.2.2: Notif Event Details | 
 | ----------------------------- | 
 | When the process runs the handler, it is actually starting up at the same | 
 | location in code as it always does.  To determine if it was a notification or | 
 | not, simply check the queue and bitmask.  This has the added benefit of allowing | 
 | a process to notice notifications that it missed previously, or notifs it wanted | 
 | without active notification (IPI).  If we want to bypass this check by having a | 
 | magic register signal, we can add that later.  Additionally, the kernel will | 
 | mask notifications (much like an x86 interrupt gate).  It will also mask | 
 | notifications when starting a core with a fresh trapframe, since the process | 
 | will be executing on its transition stack.  The process must check its per-core | 
 | event queue to see why it was called, and deal with all of the events on the | 
 | queue.  In the case where the event queue overflows, the kernel will up a | 
 | counter so the process can at least be aware things are missed.  At the very | 
 | least, the process will see the notification marked in a bitmask. | 
 |  | 
 | These notification events include things such as: an IO is complete, a | 
 | preemption is pending to this core, the process just returned from a | 
 | preemption, there was a trap (divide by 0, page fault), and many other things. | 
 | We plan to allow this list to grow at runtime (a process can request new event | 
 | notification types).  These messages will often need some form of a timestamp, | 
 | especially ones that will expire in meaning (such as a preempt_pending). | 
 |  | 
 | Note that only one notification can be active at a time, including a fault. | 
 | This means that if a process page faults or something while notifications are | 
 | masked, the process will simply be killed.    It is up to the process to make | 
 | sure the appropriate pages are pinned, which it should do before entering _M | 
 | mode. | 
 |  | 
 | 4.2.3: Event Overflow and Non-Messages | 
 | ----------------------------- | 
 | For missed/overflowed events, and for events that do not need messages (they | 
 | have no parameters and multiple notifications are irrelevant), the kernel will | 
 | toggle that event's bit in a bitmask.  For the events that don't want messages, | 
 | we may have a flag that userspace sets, meaning they just want to know it | 
 | happened.  This might be too much of a pain, so we'll see.  For notification | 
 | events that overflowed the queue, the parameters will be lost, but hopefully the | 
 | application can sort it out.  Again, we'll see.  A specific notif_event should | 
 | not appear in both the event buffers and in the bitmask. | 
 |  | 
 | It does not make sense for all events to have messages.  Others, it does not | 
 | make sense to specify a different core on which to run the handler (e.g. page | 
 | faults).  The notification methods that the process expresses via procdata are | 
 | suggestions to the kernel.  When they don't make sense, they will be ignored. | 
 | Some notifications might be unserviceable without messages.  A process needs to | 
 | have a fallback mechanism.  For example, they can read the vcoremap to see who | 
 | was lost, or they can restart a thread to cause it to page fault again. | 
 |  | 
 | Event overflow sucks - it leads to a bunch of complications.  Ultimately, what | 
 | we really want is a limitless amount of notification messages (per core), as | 
 | well as a limitless amount of notification types.  And we want these to be | 
 | relayed to userspace without trapping into the kernel.  | 
 |  | 
 | We could do this if we had a way to dynamically manage memory in procdata, with | 
 | a distrusted process on one side of the relationship.  We could imagine growing | 
 | procdata dynamically (we plan to, mostly to grow the preempt_data struct as we | 
 | request more vcores), and then run some sort of heap manager / malloc.  Things | 
 | get very tricky since the kernel should never follow pointers that userspace can | 
 | touch.  Additionally, whatever memory management we use becomes a part of the | 
 | kernel interface.   | 
 |  | 
 | Even if we had that, dynamic notification *types* is tricky - they are | 
 | identified by a number, not by a specific (list) element. | 
 |  | 
 | For now, this all seems like an unnecessary pain in the ass.  We might adjust it | 
 | in the future if we come up with clean, clever ways to deal with the problem, | 
 | which we aren't even sure is a problem yet. | 
 |  | 
 | 4.2.4: How to Use and Leave a Transition Stack | 
 | ----------------------------- | 
 | We considered having the kernel be aware of a process's transition stacks and | 
 | sizes so that it can detect if a vcore is in a notification handler based on | 
 | the stack pointer in the trapframe when a trap or interrupt fires.  While | 
 | cool, the flag for notif_disabled is much easier and just as capable. | 
 | Userspace needs to be aware of various races, and only enable notifications | 
 | when it is ready to have its transition stack clobbered.  This means that when | 
 | switching from big user-thread to user-thread, the process should temporarily | 
 | disable notifications and reenable them before starting the new thread fully. | 
 | This is analogous to having a kernel that disables interrupts while in process | 
 | context. | 
 |  | 
 | A process can fake not being on its transition stack, and even unmapping their | 
 | stack.  At worst, a vcore could recursively page fault (the kernel does not | 
 | know it is in a handler, if they keep enabling notifs before faulting), and | 
 | that would continue til the core is forcibly preempted.  This is not an issue | 
 | for the kernel. | 
 |  | 
 | When a process wants to use its transition stack, it ought to check | 
 | preempt_pending, mask notifications, jump to its transition stack, do its work | 
 | (e.g. process notifications, check for new notifications, schedule a new | 
 | thread) periodically checking for a pending preemption, and making sure the | 
 | notification queue/list is empty before moving back to real code.  Then it | 
 | should jump back to a real stack, unmask notifications, and jump to the newly | 
 | scheduled thread. | 
 |  | 
 | This can be really tricky.  When userspace is changing threads, it will need to | 
 | unmask notifs as well as jump to the new thread.  There is a slight race here, | 
 | but it is okay.  The race is that an IPI can arrive after notifs are unmasked, | 
 | but before returning to the real user thread.  Then the code will think the | 
 | uthread_ctx represents the new user thread, even though it hasn't started (and | 
 | the PC is wrong).  The trick is to make sure that all state required to start | 
 | the new thread, as well as future instructions, are all saved within the "stuff" | 
 | that gets saved in the uthread_ctx.  When these threading packages change | 
 | contexts, they ought to push the PC on the stack of the new thread, (then enable | 
 | notifs) and then execute a return.  If an IPI arrives before the "function | 
 | return", then when that context gets restarted, it will run the "return" with | 
 | the appropriate value on the stack still. | 
 |  | 
 | There is a further complication.  The kernel can send an IPI that the process | 
 | wanted, but the vcore did not get truly interrupted since its notifs were | 
 | disabled.  There is a race between checking the queue/bitmask and then enabling | 
 | notifications.  The way we deal with it is that the kernel posts the | 
 | message/bit, then sets notif_pending.  Then it sends the IPI, which may or may | 
 | not be received (based on notif_disabled).  (Actually, the kernel only ought to | 
 | send the IPI if notif_pending was 0 (atomically) and notif_disabled is 0).  When | 
 | leaving the transition stack, userspace should clear the notif_pending, then | 
 | check the queue do whatever, and then try to pop the tf.  When popping the tf, | 
 | after enabling notifications, check notif_pending.  If it is still clear, return | 
 | without fear of missing a notif.  If it is not clear, it needs to manually | 
 | notify itself (sys_self_notify) so that it can process the notification that it | 
 | missed and for which it wanted to receive an IPI.  Before it does this, it needs | 
 | to clear notif_pending, so the kernel will send it an IPI.  These last parts are | 
 | handled in pop_user_ctx(). | 
 |  | 
 | 4.3: Preemption Specifics | 
 | ------------------------------- | 
 | There's an issue with a preempted vcore getting restarted while a remote core | 
 | tries to restart that context.  They resolve this fight with a variety of VC | 
 | flags (VC_UTHREAD_STEALING).  Check out handle_preempt() in uthread.c. | 
 |  | 
 | 4.4: Other trickiness | 
 | ------------------------------- | 
 | Take all of these with a grain of salt - it's quite old. | 
 |  | 
 | 4.4.1: Preemption -> deadlock | 
 | ------------------------------- | 
 | One issue is that a context can be holding a lock that is necessary for the | 
 | userspace scheduler to manage preempted threads, and this context can be | 
 | preempted.  This would deadlock the scheduler.  To assist a process from | 
 | locking itself up, the kernel will toggle a preempt_pending flag in | 
 | procdata for that vcore before sending the actual preemption.  Whenever the | 
 | scheduler is grabbing one of these critical spinlocks, it needs to check that | 
 | flag first, and yield if a preemption is coming in. | 
 |  | 
 | Another option we may implement is for the process to be able to signal to the | 
 | kernel that it is in one of these ultra-critical sections by writing a magic | 
 | value to a specific register in the trapframe.  If there kernel sees this, it | 
 | will allow the process to run for a little longer.  The issue with this is | 
 | that the kernel would need to assume processes will always do this (malicious | 
 | ones will) and add this extra wait time to the worst case preemption time. | 
 |  | 
 | Finally, a scheduler could try to use non-blocking synchronization (no | 
 | spinlocks), or one of our other long-term research synchronization methods to | 
 | avoid deadlock, though we realize this is a pain for userspace for now.  FWIW, | 
 | there are some OSs out there with only non-blocking synchronization (I think). | 
 |  | 
 | 4.4.2: Cascading and overflow | 
 | ------------------------------- | 
 | There used to be issues with cascading interrupts (when contexts are still | 
 | running handlers).  Imagine a pagefault, followed by preempting the handler. | 
 | It doesn't make sense to run the preempt context after the page fault. | 
 | Earlier designs had issues where it was hard for a vcore to determine the | 
 | order of events and unmixing preemption, notification, and faults.  We deal | 
 | with this by having separate slots for preemption and notification, and by | 
 | treating faults as another form of notification.  Faulting while handling a | 
 | notification just leads to death.  Perhaps there is a better way to do that. | 
 |  | 
 | Another thing we considered would be to have two stacks - transition for | 
 | notification and an exception stack for faults.  We'd also need a fault slot | 
 | for the faulting trapframe.  This begins to take up even more memory, and it | 
 | is not clear how to handle mixed faults and notifications.  If you fault while | 
 | on the notification slot, then fine.  But you could fault for other reasons, | 
 | and then receive a notification.  And then if you fault in that handler, we're | 
 | back to where we started - might as well just kill them. | 
 |  | 
 | Another issue was overload.  Consider if vcore0 is set up to receive all | 
 | events.  If events come in faster than it can process them, it will both nest | 
 | too deep and process out of order.  To handle this, we only notify once, and | 
 | will not send future active notifications / interrupts until the process | 
 | issues an "end of interrupt" (EOI) for that vcore.  This is modelled after | 
 | hardware interrupts (on x86, at least). | 
 |  | 
 | 4.4.3: Restarting a Preempted Notification | 
 | ------------------------------- | 
 | Nowadays, to restart a preempted notification, you just restart the vcore. | 
 | The kernel does, either if it gives the process more cores or if userspace asked | 
 | it to with a sys_change_vcore(). | 
 |  | 
 | 4.4.4: Userspace Yield Races | 
 | ------------------------------- | 
 | Imagine a vcore realizes it is getting preempted soon, so it starts to yield. | 
 | However, it is too slow and doesn't make it into the kernel before a preempt | 
 | message takes over.  When that vcore is run again, it will continue where it | 
 | left off and yield its core.  The desired outcome is for yield to fail, since | 
 | the process doesn't really want to yield that core.  To sort this out, yield | 
 | will take a parameter saying that the yield is in response to a pending | 
 | preemption.  If the phase is over (preempted and returned), the call will not | 
 | yield and simply return to userspace. | 
 |  | 
 | 4.4.5: Userspace m_yield | 
 | ------------------------------- | 
 | There are a variety of ways to implement an m_yield (yield the entire MCP). | 
 | We could have a "no niceness" yield - just immediately preempt, but there is a | 
 | danger of the locking business.  We could do the usual delay game, though if | 
 | userspace is requesting its yield, arguably we don't need to give warning.  | 
 |  | 
 | Another approach would be to not have an explicit m_yield call.  Instead, we | 
 | can provide a notify_all call, where the notification sent to every vcore is | 
 | to yield.  I imagine we'll have a notify_all (or rather, flags to the notify | 
 | call) anyway, so we can do this for now. | 
 |  | 
 | The fastest way will probably be the no niceness way.  One way to make this | 
 | work would be for vcore0 to hold all of the low-level locks (from 4.4.1) and | 
 | manually unlock them when it wakes up.  Yikes! | 
 |  | 
 | 4.5: Random Other Stuff | 
 | ------------------------------- | 
 | Pre-Notification issues: how much time does userspace need to clean up and | 
 | yield?  How quickly does the kernel need the core back (for scheduling | 
 | reasons)? | 
 |  | 
 | Part 5: Old Arguments about Processes vs Partitions | 
 | =============================== | 
 | This is based on my interpretation of the cell (formerly what I thought was | 
 | called a partition). | 
 |  | 
 | 5.1: Program vs OS | 
 | ------------------------------- | 
 | A big difference is what runs inside the object.  I think trying to support | 
 | OS-like functionality is a quick path to unnecessary layers and complexity, | 
 | esp for the common case.  This leads to discussions of physical memory | 
 | management, spawning new programs, virtualizing HW, shadow page tables, | 
 | exporting protection rings, etc. | 
 |  | 
 | This unnecessarily brings in the baggage and complexity of supporting VMs, | 
 | which are a special case.  Yes, we want processes to be able to use their | 
 | resources, but I'd rather approach this from the perspective of "what do they | 
 | need?" than "how can we make it look like a real machine."  Virtual machines | 
 | are cool, and paravirtualization influenced a lot of my ideas, but they have | 
 | their place and I don't think this is it. | 
 |  | 
 | For example, exporting direct control of physical pages is a bad idea.  I | 
 | wasn't clear if anyone was advocating this or not.  By exposing actual machine | 
 | physical frames, we lose our ability to do all sorts of things (like swapping, | 
 | for all practical uses, and other VM tricks).  If the cell/process thinks it | 
 | is manipulating physical pages, but really isn't, we're in the VM situation of | 
 | managing nested or shadow page tables, which we don't want. | 
 |  | 
 | For memory, we'd be better off giving an allocation of a quantity frames, not | 
 | specific frames.  A process can pin up to X pages, for instance.  It can also | 
 | pick pages to be evicted when there's memory pressure.  There are already | 
 | similar ideas out there, both in POSIX and in ACPM. | 
 |  | 
 | Instead of mucking with faking multiple programs / entities within an cell, | 
 | just make more processes.  Otherwise, you'd have to export weird controls that | 
 | the kernel is doing anyway (and can do better!), and have complicated middle | 
 | layers. | 
 |  | 
 | 5.2: Multiple "Things" in a "partition" | 
 | ------------------------------- | 
 | In the process-world, the kernel can make a distinction between different | 
 | entities that are using a block of resources.  Yes, "you" can still do | 
 | whatever you want with your resources.  But the kernel directly supports | 
 | useful controls that you want.  | 
 | - Multiple protection domains are no problem.  They are just multiple | 
 |   processes.  Resource allocation is a separate topic. | 
 | - Processes can control one another, based on a rational set of rules.  Even | 
 |   if you have just cells, we still need them to be able to control one another | 
 |   (it's a sysadmin thing). | 
 |  | 
 | "What happens in a cell, stays in a cell."  What does this really mean?  If | 
 | it's about resource allocation and passing of resources around, we can do that | 
 | with process groups.  If it's about the kernel not caring about what code runs | 
 | inside a protection domain, a process provides that.  If it's about a "parent" | 
 | program trying to control/kill/whatever a "child" (even if it's within a cell, | 
 | in the cell model), you *want* the kernel to be involved.  The kernel is the | 
 | one that can do protection between entities. | 
 |  | 
 | 5.3: Other Things | 
 | ------------------------------- | 
 | Let the kernel do what it's made to do, and in the best position to do: manage | 
 | protection and low-level resources. | 
 |  | 
 | Both processes and partitions "have" resources.  They are at different levels | 
 | in the system.  A process actually gets to use the resources.  A partition is | 
 | a collection of resources allocated to one or more processes. | 
 |  | 
 | In response to this: | 
 |  | 
 | On 2009-09-15 at 22:33 John Kubiatowicz wrote: | 
 | > John Shalf wrote:   | 
 | > > | 
 | > > Anyhow, Barret is asking that resource requirements attributes be  | 
 | > > assigned on a process basis rather than partition basis.  We need | 
 | > > to justify why gang scheduling of a partition and resource | 
 | > > management should be linked.   | 
 |  | 
 | I want a process to be aware of it's specific resources, as well as the other | 
 | members of it's partition.  An individual process (which is gang-scheduled in | 
 | many-core mode) has a specific list of resources.  Its just that the overall | 
 | 'partition of system resources' is separate from the list of specific | 
 | resources of a process, simply because there can be many processes under the | 
 | same partition (collection of resources allocated). | 
 |  | 
 | > >   | 
 | > Simplicity! | 
 | >  | 
 | > Yes, we can allow lots of options, but at the end of the day, the  | 
 | > simplest model that does what we need is likely the best. I don't | 
 | > want us to hack together a frankenscheduler.   | 
 |  | 
 | My view is also simple in the case of one address space/process per | 
 | 'partition.'  Extending it to multiple address spaces is simply asking that | 
 | resources be shared between processes, but without design details that I | 
 | imagine will be brutally complicated in the Cell model. | 
 |  | 
 |  | 
 | Part 6: Use Cases | 
 | =============================== | 
 | 6.1: Matrix Multiply / Trusting Many-core app | 
 | ------------------------------- | 
 | The process is created by something (bash, for instance).  It's parent makes | 
 | it runnable.  The process requests a bunch of cores and RAM.  The scheduler | 
 | decides to give it a certain amount of resources, which creates it's partition | 
 | (aka, chunk of resources granted to it's process group, of which it is the | 
 | only member).  The sysadmin can tweak this allocation via procfs. | 
 |  | 
 | The process runs on its cores in it's many-core mode.  It is gang scheduled, | 
 | and knows how many cores there are.  When the kernel starts the process on | 
 | it's extra cores, it passes control to a known spot in code (the ELF entry | 
 | point), with the virtual core id passed as a parameter. | 
 |  | 
 | The code runs from a single binary image, eventually with shared | 
 | object/library support.  It's view of memory is a virtual address space, but | 
 | it also can see it's own page tables to see which pages are really resident | 
 | (similar to POSIX's mincore()). | 
 |  | 
 | When it comes time to lose a core, or be completely preempted, the process is | 
 | notified by the OS running a handler of the process's choosing (in userspace). | 
 | The process can choose what to do (pick a core to yield, prepare to be | 
 | preempted, etc). | 
 |  | 
 | To deal with memory, the process is notified when it page faults, and keeps | 
 | its core.  The process can pin pages in memory.  If there is memory pressure, | 
 | the process can tell the kernel which pages to unmap. | 
 |  | 
 | This is the simple case. | 
 |  | 
 | 6.2: Browser | 
 | ------------------------------- | 
 | In this case, a process wants to create multiple protection domains that share | 
 | the same pool of resources.  Or rather, with it's own allocated resources. | 
 |  | 
 | The browser process is created, as above.  It creates, but does not run, it's | 
 | untrusted children.  The kernel will have a variety of ways a process can | 
 | "mess with" a process it controls.  So for this untrusted child, the parent | 
 | can pass (for example), a file descriptor of what to render, "sandbox" that | 
 | process (only allow a whitelist of syscalls, e.g. can only read and write | 
 | descriptors it has).  You can't do this easily in the cell model. | 
 |  | 
 | The parent can also set up a shared memory mapping / channel with the child. | 
 |  | 
 | For resources, the parent can put the child in a subdirectory/ subpartition | 
 | and give a portion of its resources to that subpartition.  The scheduler will | 
 | ensure that both the parent and the child are run at the same time, and will | 
 | give the child process the resources specified.  (cores, RAM, etc). | 
 |  | 
 | After this setup, the parent will then make the child "runnable".  This is why | 
 | we want to separate the creation from the runnability of a process, which we | 
 | can't do with the fork/exec model. | 
 |  | 
 | The parent can later kill the child if it wants, reallocate the resources in | 
 | the partition (perhaps to another process rendering a more important page), | 
 | preempt that process, whatever. | 
 |  | 
 | 6.3: SMP Virtual Machines | 
 | ------------------------------- | 
 | The main issue (regardless of paravirt or full virt), is that what's running | 
 | on the cores may or may not trust one another.  One solution is to run each | 
 | VM-core in it's own process (like with Linux's KVM, it uses N tasks (part of | 
 | one process) for an N-way SMP VM).  The processes set up the appropriate | 
 | shared memory mapping between themselves early on.  Another approach would be | 
 | to allow a many-cored process to install specific address spaces on each | 
 | core, and interpose on syscalls, privileged instructions, and page faults. | 
 | This sounds very much like the Cell approach, which may be fine for a VM, but | 
 | not for the general case of a process. | 
 |  | 
 | Or with a paravirtualized SMP guest, you could (similar to the L4Linux way,) | 
 | make any Guest OS processes actual processes in our OS.  The resource | 
 | allocation to the Guest OS partition would be managed by the parent process of | 
 | the group (which would be running the Guest OS kernel).  We still need to play | 
 | tricks with syscall redirection. | 
 |  | 
 | For full virtualization, we'd need to make use of hardware virtualization | 
 | instructions. Dealing with the VMEXITs, emulation, and other things is a real | 
 | pain, but already done.  The long range plan was to wait til the | 
 | http://v3vee.org/ project supported Intel's instructions and eventually | 
 | incorporate that. | 
 |  | 
 | All of these ways involve subtle and not-so-subtle difficulties.  The | 
 | Cell-as-OS mode will have to deal with them for the common case, which seems | 
 | brutal.  And rather unnecessary. |