|  | kthreads.txt | 
|  | Barret Rhoden | 
|  |  | 
|  | What Are They, And Why? | 
|  | ------------------------------- | 
|  | Eventually a thread of execution in the kernel will want to block.  This means | 
|  | that the thread is unable to make forward progress and something else ought | 
|  | to run - the common case for this is when we wait on an IO operation.  This gets | 
|  | trickier when a function does not know if a function it calls will block or not. | 
|  | Sometimes they do, sometimes they don't. | 
|  |  | 
|  | The critical feature is not that we want to save the registers, but that we want | 
|  | to preserve the stack and be able to use it independently of whatever else we do | 
|  | on that core in the interim time.  If we knew we would be done with and return | 
|  | from whatever_else() before we needed to continue the current thread of | 
|  | execution, we could simply call the function.  Instead, we want to be able to | 
|  | run the old context independently of what else is running (which may be a | 
|  | process). | 
|  |  | 
|  | We call this suspended context and the associated information a kthread, managed | 
|  | by a struct kthread.  It's the bare minimum needed for the kernel to stop and | 
|  | restart a thread of execution.  It holds the registers, stack pointer, PC, | 
|  | struct proc* if applicable, stacktop, and little else.  There is no silly_state | 
|  | / floating point state, or anything else.  Its address space is determined from | 
|  | which process context (possibly none) that was running. | 
|  |  | 
|  | We also get a few other benefits, such as the ability to pick and choose which | 
|  | kthreads to run where and when.  Users of kthreads should not assume that the | 
|  | core_id() stayed the same across blocking calls. | 
|  |  | 
|  | We can also use this infrastructure in other cases where we might want to start | 
|  | on a new stack.  One example is when we deal with low memory.  We may have to do | 
|  | a lot of work, but only need to do a little to allow the original thread (that | 
|  | might have failed on a page_alloc) to keep running, while we want the memory | 
|  | freer to keep running too (or later) from where it left off.  In essence, we | 
|  | want to fork, work, and yield or run on another core.  The kthread is just a | 
|  | means of suspending a call stack and a context for a little while. | 
|  |  | 
|  | Side Note: | 
|  | ----------- | 
|  | Right now, blocking a kthread is an explicit action.  Some function realizes it | 
|  | can't make progress (like waiting on a block device), so it sleeps on something | 
|  | (for now a semaphore), and gets woken up when it receives its signal.  This | 
|  | differs from processes, which can be stopped and suspended at any moment | 
|  | (pagefault is the classic example).  In the future, we could make kthreads be | 
|  | preemptable (timer interrupt goes off, and we choose to suspend a kthread), but | 
|  | even then kthreads still have the ability to turn off interrupts for tricky | 
|  | situations (like suspending the kthread).  The analog in the process code is | 
|  | disabling notifications, which dramatically complicates its functions (compare | 
|  | the save and pop functions for _ros_tf and _kernel_tf).  Furthermore, when a | 
|  | process disables notifications, it still doesn't mean it is running without | 
|  | interruptions (it looks like that to the vcore).  When the kernel disables | 
|  | interrupts, it really is running. | 
|  |  | 
|  | What About Events? | 
|  | ------------------------------- | 
|  | Why not just be event driven for all IO?  Why do we need these kernel threads? | 
|  | In short, IO isn't as simple as "I just want a block and when it's done, run a | 
|  | function."  While that is what the block device driver will do, the subsystems | 
|  | actually needing the IO are much simpler if they are threaded.  Consider the | 
|  | potentially numerous blocking IO calls involved in opening a file.  Having a | 
|  | continuation for each one of those points in the call graph seems like a real | 
|  | pain to code.  Perhaps I'm not seeing it, but if you're looking for a simple, | 
|  | light mechanism for keeping track of what work you need to do, just use a stack. | 
|  | Programming is much simpler, and it costs a page plus a small data structure. | 
|  |  | 
|  | Note that this doesn't mean that all IO needs to use kthreads, just that some | 
|  | will really benefit from it.  I plan to make the "last part" of some IO calls | 
|  | more event driven.  Basically, it's all just a toolbox, and you should use what | 
|  | you need. | 
|  |  | 
|  | Freeing Stacks and Structs | 
|  | ------------------------------- | 
|  | When we restart a kthread, we have to be careful about freeing the old stack and | 
|  | the struct kthread.  We need to delay the freeing of both of these until after | 
|  | we longjmp to the new kthread.  We can't free the kthread before popping it, | 
|  | and we are on the stack we need to free (until we pop to the new stack). | 
|  |  | 
|  | To deal with this, we have a "spare" kthread per core, which gets assigned as | 
|  | the spare when we restart a previous kthread.  When making/suspending a kthread, | 
|  | we'll use this spare.  When restarting one, we'll free the old spare if it | 
|  | exists and put ours there.  One drawback is that we potentially waste a chunk of | 
|  | memory (1 page + a bit per core, worst case), but it is a nice, simple solution. | 
|  | Also, it will cut down on contention for free pages and the kthread_kcache, | 
|  | though this won't help with serious contention issues (which we'll deal with | 
|  | eventually). | 
|  |  | 
|  | What To Run Next? | 
|  | ------------------------------- | 
|  | When a kthread suspends, what do we run next?  And how do we know what to run | 
|  | next?  For now, we call smp_idle() - it is what you do when you have nothing | 
|  | else to do, or don't know what to do.  We could consider having sleep_on() take | 
|  | a function pointer, but when we start hopping stacks, passing that info gets | 
|  | tricky.  And we need to make a decision about which function to call quickly (in | 
|  | the code.  I don't trust the compiler much).  We can store the function pointer | 
|  | at the bottom of the future stack and extract it from there.  Or we could put it | 
|  | in per_cpu_info.  Or we can send ourselves a routine kernel message. | 
|  |  | 
|  | Regardless of where we put it, we ought to call smp_idle() (or something | 
|  | similar) before calling it, since we need to make sure that whatever we call | 
|  | right after jumping stacks never returns.  It's more flexible to allow a | 
|  | function that returns for the func *, so we'll use smp_idle() as a level of | 
|  | indirection. | 
|  |  | 
|  | Semaphore Stuff | 
|  | ------------------------------- | 
|  | We use the semaphore (defined in kthread.h) for kthreads to sleep on and wait | 
|  | for a signal.  It is possible that the signal wins the race and beats the call | 
|  | to sleep_on().  The semaphore handles this by "returning false."  You'll notice | 
|  | that we don't actually call __down_sem(), but instead "build it in" to | 
|  | sleep_on().  I didn't want to deal with returning a bool (even if it was an | 
|  | inline), because I want to minimize the amount of stuff we do with potential | 
|  | stack variables (I don't trust the register variable).  As soon as we unlock, | 
|  | the kthread could be restarted (in theory), and it could start to clobber the | 
|  | stack in later function calls. | 
|  |  | 
|  | So it is possible that we lose the semaphore race and shouldn't sleep.  We | 
|  | unwind the sleep prep work.  An alternative was to only do the prep work if we | 
|  | won the race, but that would mean we have to do a lot of work in that delicate | 
|  | period of "I'm on the queue but it is unlocked" - work that requires touching | 
|  | the stack.  Or we could just hold the lock for a longer period of time, which | 
|  | I don't care to do.  What we do now is try and down the semaphore early (the | 
|  | early bailout), and if it fails then try to sleep (unlocked).  If it then | 
|  | loses the race (unlikely), it can manually unwind. | 
|  |  | 
|  | Note that a lot of this is probably needless worry - we have interrupts disabled | 
|  | for most of sleep_on(), though arguably we can be a little more careful with | 
|  | pcpui->spare and move the disable_irq() down to right before setjmp(). | 
|  |  | 
|  | What's the Deal with Stacks/Stacktops? | 
|  | ------------------------------- | 
|  | When the kernel traps from userspace, it needs to know what to set the kernel | 
|  | stack pointer to.  In x86, it looks in the TSS.  In riscv, we have a data | 
|  | structure tracking that info (core_stacktops).  One thing I considered was | 
|  | migrating the kernel from its boot stacks (x86, just core0, riscv, all the cores | 
|  | have one).  Instead, we just make sure the tables/TSS are up to date right away | 
|  | (before interrupts or traps can come in for x86, and right away for riscv). | 
|  | These boot stacks aren't particularly special, just note they are in the program | 
|  | data/bss sections and were never originally added to a free list.  But they can | 
|  | be freed later on.  This might be an issue in some places, but those places | 
|  | ought to be fixed. | 
|  |  | 
|  | There is also some implications about PGSIZE stacks (specifically in the | 
|  | asserts, how we alloc only one page, etc).  The bootstacks are bigger than a | 
|  | page (for now), but in general we don't want to have giant stacks (and shouldn't | 
|  | need them - note linux runs with 4KB stacks).  In the future (long range, when | 
|  | we're 64 bit), I'd like to put all kernel stacks high in the address space, with | 
|  | guard pages after them.  This would require a certain "quiet migration" to the | 
|  | new locations for the bootstacks (though not a new page - just a different | 
|  | virtual address for the stacks (not their page-alloced KVA).  A bunch of minor | 
|  | things would need to change for that, so don't hold your breath. | 
|  |  | 
|  | So what about stacktop?  It's just the top of the stack, but sometimes it is the | 
|  | stack we were on (when suspending the kthread), other times kthread->stacktop | 
|  | is just a scrap page's top. | 
|  |  | 
|  | What's important when suspending is that the current stack is not | 
|  | used in future traps - that it doesn't get clobbered.  That's why we need to | 
|  | find a new stack and set it as the current stacktop.  We also need to 'save' | 
|  | the stack page of the old kthread - we don't want it to be freed, since we | 
|  | need it later. When starting a kthread, I don't particularly care about which | 
|  | stack is now the default stack.  The sleep_on() assumes it was the kthread's, | 
|  | so unless we always have a default one that is only used very briefly and | 
|  | never blocked on, (which requires a stack jump), we ought to just have a | 
|  | kthread run with its stack as the default stacktop. | 
|  |  | 
|  | When restarting a kthread, we eventually will use its stack, instead of the | 
|  | current one, but we can't free the current stack until after we actually | 
|  | longjmp() to it.  This is the same problem as with the struct kthread dealloc. | 
|  | So we can have the kthread (which we want to free later) hold on to the page we | 
|  | wanted to dealloc.  Likewise, when we would need a fresh kthread, we also need a | 
|  | page to use as the default stacktop.  So if we had a cached kthread, we then use | 
|  | the page that kthread was pointing to.  NOTE: the spare kthread struct is not | 
|  | holding the stack it was originally saved with.  Instead, it is saving the page | 
|  | of the stack that was running when that kthread was reactivated.  It's spare | 
|  | storage for both the struct and the page, but they aren't linked in any | 
|  | meaningful way (like it is the stack of the page).  That linkage is only true | 
|  | when a kthread is being used (like in a semaphore queue). | 
|  |  | 
|  | Current and Process Contexts | 
|  | ------------------------------- | 
|  | When a kthread is suspended, should the core stay in process context (if it was | 
|  | before)?  Short answer: yes. | 
|  |  | 
|  | For vcore local calls (process context, trapped on the calling core), we're | 
|  | giving the core back, so we can avoid TLB shootdowns.  Though we do have to | 
|  | incref (which writes a cache line in the proc struct), since we are storing a | 
|  | reference to the proc (and will try to load its cr3 later).  While this sucks, | 
|  | keep in mind this is for a blocking IO call (where we couldn't find the page in | 
|  | any cache, etc).  It might be a scalability bottleneck, but it also might not | 
|  | matter in any real case. | 
|  |  | 
|  | For async calls, it is less clear.  We might want to keep processing that | 
|  | process's syscalls, so it'd be easier to keep its cr3 loaded.  Though it's not | 
|  | as clear how we get from smp_idle() to a workable function and if it is useful | 
|  | to be in process context until we start processing those functions again.  Keep | 
|  | in mind that normally, smp_idle() shouldn't be in any process's context.  I'll | 
|  | probably write something later that abandons any context before halting to make | 
|  | sure processes die appropriately.  But there are still some unresolved issues | 
|  | that depend on what exactly we want to do. | 
|  |  | 
|  | While it is tempting to say that we stay in process context if it was local, but | 
|  | not if it is async, there is an added complication.  The function calling | 
|  | sleep_on() doesn't care about whether it is on a process-allocated core or not. | 
|  | This is solvable by using per_cpu_info(), and will probably work its way into a | 
|  | future patch, regardless of whether or not we stay in process context for async | 
|  | calls. | 
|  |  | 
|  | As a final case, what will we do for processes that were interrupted by | 
|  | something that wants to block, but wasn't servicing a syscall?  We probably | 
|  | shouldn't have these (I don't have a good example of when we'd want it, and a | 
|  | bunch of reasons why we wouldn't), but if we do, then it might be okay anyway - | 
|  | the kthread is just holding that proc alive for a bit.  Page faults are a bit | 
|  | different - they are something the process wants at least.  I was thinking more | 
|  | about unrelated async events.  Still, shouldn't be a big deal. | 
|  |  | 
|  | Kmsgs and Kthreads | 
|  | ------------------------------- | 
|  | Is there a way to mix kernel messages and kthreads?  What's the difference, and | 
|  | can one do the other?  A kthread is a suspended call-stack and context (thread), | 
|  | stopped in the middle of its work.  Kernel messages are about starting fresh - | 
|  | "hey core X, run this function."  A kmsg can very easily be a tool used to | 
|  | restart a kthread (either locally or on another core).  We do this in the test | 
|  | code, if you're curious how it could work. | 
|  |  | 
|  | Note we use the semaphore to deal with races.  In test_kthreads(), we're | 
|  | actually using the kmsg to up the semaphore.  You just as easily could up the | 
|  | semaphore in one core (possibly in response to a kmsg, though more likely due to | 
|  | an interrupt), and then send the kthread to another core to restart via a kmsg. | 
|  |  | 
|  | There's no reason you can't separate the __up_sem() and the running of the | 
|  | kthread - the semaphore just protects you from missing the signal.  Perhaps | 
|  | you'll want to rerun the kthread on the physical core it was suspended on! | 
|  | (cache locality, and it might be a legit option to allow processes to say it's | 
|  | okay to take their vcore).  Note this may require more bookkeeping in the struct | 
|  | kthread. | 
|  |  | 
|  | There is another complication: the way we've been talking about kmsgs (starting | 
|  | fresh), we are talking about *routine* messages.  One requirement for routine | 
|  | messages that do not return is that they handle process state.  The current | 
|  | kmsgs, such as __death and __preempt are built so they can handle acting on | 
|  | whichever process is currently running.  Likewise, __launch_kthread() needs to | 
|  | handle the cases that arise when it runs on a core that was about to run a | 
|  | process (as can often happen with proc_restartcore(), which calls | 
|  | process_routine_kmsg()).  Basically, if it was a _S, it just yields the process, | 
|  | similar to what happens in Linux (call schedule() on the way out, from what I | 
|  | recall).  If it was a _M, things are a bit more complicated, since this should | 
|  | only happen if the kthread is for that process (and probably a bunch of other | 
|  | things - like they said it was okay to interrupt their vcore to finish the | 
|  | syscall).  Note - this might not be accurate anymore (see discussions on | 
|  | current_ctx). | 
|  |  | 
|  | To a certain extent, routine kmsgs don't seem like a nice fit, when we really | 
|  | want to be calling schedule().  Though if you think of it as the enactment of a | 
|  | previous scheduling decision (like other kmsgs (__death())), then it makes more | 
|  | sense.  The scheduling decision (as of now) was made in the interrupt handler | 
|  | when it decided to send the kernel msg.  In the future, we could split this into | 
|  | having the handler make the kthread active, and have the scheduler called to | 
|  | decide where and when to run the kthread. | 
|  |  | 
|  | Current_ctx, Returning Twice, and Blocking | 
|  | -------------------------------- | 
|  | One of the reasons for decoupling kthreads from a vcore or the notion of a | 
|  | kernel thread per user processs/task is so that when the kernel blocks (on a | 
|  | syscall or wherever), it can return to the process.  This is the essence of the | 
|  | asynchronous kernel/syscall interface (though it's not limited to syscalls | 
|  | (pagefaults!!)).  Here is what we want it to be able to handle: | 
|  | - When a process traps (syscall, trap, or actual interrupt), the process regains | 
|  | control when the kernel is done or when it blocks. | 
|  | - Any kernel path can block at any time. | 
|  | - Kernel control paths need to not "return twice", but we don't want to have to | 
|  | go through acrobatics in the code to prevent this. | 
|  |  | 
|  | There are a couple of approaches I considered, and it involves the nature of | 
|  | "current_ctx", and a brutal bug.  Current_ctx (formerly current_ctx) is a | 
|  | pointer to the trapframe of the process that was interrupted/trapped, and is | 
|  | what user context ought to be running on this core if we return.  Current_ctx is | 
|  | 'made' when the kernel saves the context at the top of the interrupt stack (aka | 
|  | 'stacktop').  Then the kernel's call path proceeds down the same stack.  This | 
|  | call path may get blocked in a kthread.  When we block, we want to restart the | 
|  | current_ctx.  There is a coupling between the kthread's stack and the storage of | 
|  | current_ctx (contents, not the pointer (which is in pcpui)). | 
|  |  | 
|  | This coupling presents a problem when we are in userspace and get interrupted, | 
|  | and that interrupt wants to restart a kthread.  In this case, current_ctx points | 
|  | to the interrupt stack, but then we want to switch to the kthread's stack.  This | 
|  | is okay.  When that kthread wants to block again, it needs to switch back to | 
|  | another stack.  Up until this commit, it was jumping to the top of the old stack | 
|  | it was on, clobbering current_ctx (took about 8-10 hours to figure this out). | 
|  | While we could just make sure to save space for current_ctx, it doesn't solve | 
|  | the problem: namely that the current_ctx concept is not bound to a specific | 
|  | kernel stack (kthread or otherwise).  We could have cases where more than one | 
|  | kthread starts up on a core and we end up freeing the page that holds | 
|  | current_ctx (since it is a stack we no longer need).  We don't want to bother | 
|  | keeping stacks around just to hold the current_ctx.  Part of the nature of this | 
|  | weird coupling is that a given kthread might or might not have the current_ctx | 
|  | at the top of its stack.  What a pain in the ass... | 
|  |  | 
|  | The right answer is to decouple current_ctx from kthread stacks.  There are two | 
|  | ways to do this.  In both ways, current_ctx retains its role of the context the | 
|  | kernel restarts (or saves) when it goes back to a process, and is independent of | 
|  | blocking kthreads.  SPOILER: solution 1 is not the one I picked | 
|  |  | 
|  | 1) All traps/interrupts come in on one stack per core.  That stack never changes | 
|  | (regardless of blocking), and current_ctx is stored at the top.  Kthreads sort | 
|  | of 'dispatch' / turn into threads from this event-like handling code.  This | 
|  | actually sounds really cool! | 
|  |  | 
|  | 2) The contents of current_ctx get stored in per-cpu-info (pcpui), thereby | 
|  | clearly decoupling it from any execution context.  Any thread of execution can | 
|  | block without any special treatment (though interrupt handlers shouldn't do | 
|  | this).  We handle the "returning twice" problem at the point of return. | 
|  |  | 
|  | One nice thing about 1) is that it might make stack management easier (we | 
|  | wouldn't need to keep a spare page, since it's the default core stack).  2) is | 
|  | also tricky since we need to change some entry point code to write the TF to | 
|  | pcpui (or at least copy-out for now). | 
|  |  | 
|  | The main problem with 1) is that you need to know and have code to handle when | 
|  | you "become" a kthread and are allowed to block.  It also prevents us making | 
|  | changes such that all executing contexts are kthreads (which sort of is what is | 
|  | going on, even if they don't have a struct yet). | 
|  |  | 
|  | While considering 1), here's something I wanted to say: "every thread of | 
|  | execution, including a KMSG, needs to always return (and thus not block), or | 
|  | never return (and be allowed to block)."  To "become" a kthread, we'd need to | 
|  | have code that jumps stacks, and once it jumps it can never return.  It would | 
|  | have to go back to some place such as smp_idle(). | 
|  |  | 
|  | The jumping stacks isn't a problem, and whatever we jump to would just have to | 
|  | have smp_idle() at the end.  The problem is that this is a pain in the ass to | 
|  | work with in reality.  But wait!  Don't we do that with batched syscalls right | 
|  | now?  Yes (though we should be using kmsgs instead of the hacked together | 
|  | workqueue spread across smp_idle() and syscall.c), and it is a pain in the ass. | 
|  | It is doable with syscalls because we have that clearly defined point | 
|  | (submitting vs processing).  But what about other handlers, such as the page | 
|  | fault handler?  It could block, and lots of other handlers could block too.  All | 
|  | of those would need to have a jump point (in trap.c).  We aren't even handling | 
|  | events anymore, we are immediately jumping to other stacks, using our "event | 
|  | handler" to hold current_ctx and handle how we return to current_ctx.  Don't | 
|  | forget about other code (like the boot code) that wants to block.  Simply put, | 
|  | option 1 creates a layer that is a pain to work with, cuts down on the | 
|  | flexibility of the kernel to block when it wants, and doesn't handle the issue | 
|  | at its source. | 
|  |  | 
|  | The issue about having a defined point in the code that you can't return back | 
|  | across (which is where 1 would jump stacks) is about "returning twice."  Imagine | 
|  | a syscall that doesn't block.  It traps into the kernel, does its work, then | 
|  | returns.  Now imagine a syscall that blocks.  Most of these calls are going to | 
|  | block on occasion, but not always (imagine the read was filled from the page | 
|  | cache).  These calls really need to handle both situations.  So in one instance, | 
|  | the call blocks.  Since we're async, we return to userspace early (pop the | 
|  | current_ctx).  Now, when that kthread unblocks, its code is going to want to | 
|  | finish and unroll its stack, then pop back to userspace.  This is the 'returning | 
|  | twice' problem.  Note that a *kthread* never returns twice.  This is what makes | 
|  | the idea of magic jumping points we can't return back across (and tying that to | 
|  | how we block in the kernel) painful. | 
|  |  | 
|  | The way I initially dealt with this was by always calling smp_idle(), and having | 
|  | smp_idle decide what to do.  I also used it as a place to dispatch batched | 
|  | syscalls, which is what made smp_idle() more attractive.  However, after a bit, | 
|  | I realized the real nature of returning twice: current_ctx.  If we forget about | 
|  | the batching for a second, all we really need to do is not return twice.  The | 
|  | best place to do that is at the place where we consider returning to userspace: | 
|  | proc_restartcore().  Instead of calling smp_idle() all the time (which was in | 
|  | essence a "you can now block" point), and checking for current_ctx to return, | 
|  | just check in restartcore to see if there is a tf to restart.  If there isn't, | 
|  | then we smp_idle().  And don't forget to handle the cases where we want to start | 
|  | and scp_ctx (which we ought to point current_ctx to in proc_run()). | 
|  |  | 
|  | As a side note, we ought to use kmsgs for batched syscalls - it will help with | 
|  | preemption latencies.  At least for all but the first syscall (which can be | 
|  | called directly).  Instead of sending a retval via current_ctx about how many | 
|  | started, just put that info in the syscall struct's flags (which might help the | 
|  | remote syscall case - no need for a response message, though there are still a | 
|  | few differences (no failure model other than death)). | 
|  |  | 
|  | Note that page faults will still be tricky, but at least now we don't have to | 
|  | worry about points of no return.  We just check if there is a current_ctx to | 
|  | restart.  The tricky part is communicating that the PF was sorted when there | 
|  | wasn't an explicit syscall made. | 
|  |  | 
|  |  | 
|  | Aborting Syscalls (2013-11-22) | 
|  | ------------------------------- | 
|  | On occasion, userspace would like to abort a syscall, specifically ones that | 
|  | are listening on sockets/conversations where no one will ever answer. | 
|  |  | 
|  | We have limited support for aborting syscalls.  Kthreads that are in | 
|  | rendez_sleep() (common for anything in the 9ns chunk of the kernel, which | 
|  | includes any conversation listens) can be aborted.  They'll return with an | 
|  | error string to userspace. | 
|  |  | 
|  | The easier part is the rules for kernel code to be abortable: | 
|  | - Restore your invariants with waserror() before calling rendez_sleep(). | 
|  | - That's really it. | 
|  | So if you're holding a qlock, put your qunlock() code and any other unwinding | 
|  | (such as a kfree()) in a waserror() catch.  As it happens, it looks like plan9 | 
|  | already did that (at least for the rendez in listen).  And, as always, you | 
|  | can't hold a spinlock when blocking, regardless of aborting calls or anything. | 
|  |  | 
|  | I don't want arbitrary sleeps to be abortable.  For instance, if a kthread is | 
|  | waiting on an arbitrary semaphore/qlock, we won't allow an abort.  The | 
|  | reasoning is that the kthread will eventually acquire the qlock - we're not | 
|  | waiting on external sources to wake up.  That's not 100% true - a kthread | 
|  | could be blocked on a qlock, and the qlock holder could be abortable.  In the | 
|  | future, we could build some sort of "abort inheritance", usable by root or | 
|  | something (danger of aborting another process's kthread).  Alternatively, we | 
|  | could make qlocks abortable too, though that would require all qlocking code | 
|  | to be unwindable. | 
|  |  | 
|  | The harder part to syscall aborting is safely waking a kthread.  There are | 
|  | several layers to go through from uthread or syscall down to the condition | 
|  | variable a kthread is sleeping on.  Given a uthread, find its syscall.  Given | 
|  | a syscall, find its kthread.  Given the kthread, find the CV.  And during all | 
|  | of these, syscalls complete concurrently, kthreads get repurposed for other | 
|  | syscalls, CVs could be freed (though that doesn't happen).  Syscalls are often | 
|  | on stacks, so when they complete, the memory is both gibberish and potentially | 
|  | in use. | 
|  |  | 
|  | Ultimately, I decided on a system of "safe abort attempts", where it is | 
|  | harmless to be wrong with an attempt.  Instead of dealing with the races | 
|  | associated with memory freeing and syscalls completing, the aborts will only | 
|  | work if it is safe to work (using a lookup via pointer, and only dereferencing | 
|  | if the lookup succeeds). | 
|  |  | 
|  | As it stands now, all abortable kthreads/sleepers/syscalls are on a per-proc | 
|  | list, and we can lookup by struct syscall*.  They are only on the list when | 
|  | they are abortable (the CV can be poked), and the invariant is that when they | 
|  | are on the list, they are in a state that can be safely aborted: the kthread | 
|  | is working on the syscall, it hasn't unwound, it is still in rendez_sleep(), | 
|  | the CV is safe, etc.  The details of this protection are sorted out with | 
|  | __reg_abortable_cv() and dereg_abortable_cv() (since it's really the condition | 
|  | variable that we're trying to find).  So from the kernel side, nothing bad can | 
|  | happen if you ask to abort an arbitrary struct syscall*. | 
|  |  | 
|  | The actual abort takes the "write/signal, then wake" method.  The aborter | 
|  | tracks down the kthread via the lookup, the success of which guarantees the | 
|  | sleeper is in rendez_sleep() (or similar sleep paths), marks "SC_ABORT", | 
|  | (barriers), and attempts to wake the kthread (cv_broadcast, since we need to | 
|  | be sure we woke the right kthread). | 
|  |  | 
|  | On the user side, we set an alarm to run an event handler that will cancel our | 
|  | syscall.  The alarm stuff is fairly standard (runs in vcore context). | 
|  | Userspace no longer has the concern of syscalls completing while they abort, | 
|  | since the kernel will only abort syscalls that are abortable.  However, it may | 
|  | have issues (in theory) with aborting future syscalls.  If the alarm goes off | 
|  | when the uthread is in another later syscall (which may happen to use the same | 
|  | struct syscall*), then we could accidentally abort the wrong call.  There's an | 
|  | aspect of time associated with the first abort alarm handler.  This is | 
|  | relatively easy to handle: just turn off the alarm before reusing that syscall | 
|  | struct for a syscall.  This relies on a property of the alarms: that when | 
|  | deregistering completes, the alarm handler will not be running concurrently. | 
|  | Incidentally, there is *another* minor trick here: the uthread when adjusting | 
|  | the alarm will issue a syscall, possibly reusing its old sysc*, but that will | 
|  | be *after* deregistering its original alarm: the point at which we could have | 
|  | potentially accidentally cancelled an arbitrary syscall.  Also note that the | 
|  | call to change the kernel alarm wouldn't actually block and become abortable, | 
|  | but regardless, we're safe. | 
|  |  | 
|  | There are a couple downsides to the "safe abort attempts" approach.  We can | 
|  | only abort syscalls when they are at a certain point - if they aren't | 
|  | currently sleeping, the call will fail.  Technically, the abort could take | 
|  | effect later on in the life of a syscall (the aborter flags the kthread to | 
|  | abort concurrent with the kthread waking up naturally, and then the call | 
|  | aborts on the next rendez_sleep that needs to block).  Related to this | 
|  | limitation, userspace must keep attempting to cancel a syscall until it | 
|  | succeeds.  It may also be told an abort succeeded, even if the call actually | 
|  | completes (the aborter flags the kthread, the rendez wakes naturally, and the | 
|  | kthread never blocks again).  Ultimately, we can't "fire and forget" our abort | 
|  | attempt.  It's not a huge problem though, and is less of a problem than my | 
|  | older approaches that didn't have this problem. | 
|  |  | 
|  | For instance, the original idea I had was for userspace to flag the syscall | 
|  | (flags |= SC_ABORT).  It could do this at any time.  Whenever the kthread was | 
|  | going to block in an abortable location (e.g. rendez_sleep()), it would see | 
|  | the flag and abort.  It might already be asleep, so we would also provide a | 
|  | syscall that would 'kick' the kthread responsible for some sysc*, to wake it | 
|  | up to see the flag and abort.  The first problem was writing to the sysc | 
|  | flags.  Unless we know the memory is actually the syscall we want, this could | 
|  | result in randomly writing to memory (such as a uthread's stack).  I ran into | 
|  | similar issues in the kernel: you can't touch a kthread struct unless you know | 
|  | it is the kthread you want. | 
|  |  | 
|  | Once I started dealing with the syscall -> kthread mapping, it became clear | 
|  | I'd need a per-proc lookup service in the kernel, which acts as a way to lock | 
|  | a reference to the kthread.  I could solve the 'kthread memory safety' problem | 
|  | by looking up by reference, similar to how pid2proc works.  Analogously, by | 
|  | changing the interface for sys_abort_syscall() to be a "lookup" approach, I | 
|  | solve the struct syscall * memory problem. | 
|  |  | 
|  | As a smaller note, I considered registering every kthread with the process | 
|  | right away (specifically, when we link the syscall to the kthread->sysc) for | 
|  | the sysc->kthread lookup service.  But this would get expensive, since every | 
|  | syscall pays the lookup tax (and we'd need to worry about scaling).  We want | 
|  | syscalls to be fast, but the infrequent aborts can be expensive.  The obvious | 
|  | change was to only save the abortable kthreads.  The tradeoff is that we can't | 
|  | flag syscalls for aborting unless they are in an abortable state.  This | 
|  | requires multiple pokes by userspace.  In theory, they would have to deal with | 
|  | that scenario anyways (in case they attempt to abort before we even register | 
|  | in the first place). | 
|  |  | 
|  | As another side note, if userspace ever has a struct syscall allocator, for | 
|  | use in async (non-uthread stack-based) syscalls, we'll need to not reuse a | 
|  | syscall struct until after the cancel alarm has been disarmed.  Right now we | 
|  | do this by not having the uthread issue another syscall til after the disarm, | 
|  | since uthread stack-based syscalls are inherently bound to the uthread.  A | 
|  | simple solution would be to have a per-uthread syscall struct, which that | 
|  | uthread uses preferentially, and the sysc is only freed when the uthread is | 
|  | freed.  Not only would this scale better than accessing the sysc allocator for | 
|  | every syscall, but also there is no worry of reuse til the uthread disarms and | 
|  | exits. | 
|  |  | 
|  | It is a userspace bug for a uthread to set the alarm and not unset it before | 
|  | either making a syscall or exiting.  The root issue of that potential bug is | 
|  | that someone (alarm handler) holds a pointer to a uthread, with the intent of | 
|  | cancelling its syscall, and we need to somehow take back that pointer (cancel | 
|  | the alarm) before reusing the syscall or freeing the uthread.  I considered | 
|  | not making the alarm guarantee that when the cancel returns, the handler isn't | 
|  | running concurrently.  We could handle the races in the alarm handler and in | 
|  | the cancel code, but it's an added hassle that isn't clearly needed.  This | 
|  | does mean we have to run the alarm handlers serially, while holding the alarm | 
|  | lock.  I'm fine with this, for now.  Perhaps if users want more concurrency, | 
|  | their handlers can spawn or wake up a uthread. | 
|  |  | 
|  | It is also worth noting that many rendez_sleep() calls actually return right | 
|  | away.  This is common if some data is already in the queue (or whatever the | 
|  | condition is that we want to conditionally sleep on).  Since registration is a | 
|  | little bit heavier than just locking the CV, I use the classic "check, signal, | 
|  | check again" style, where we check cond, then register, and then check cond | 
|  | for real.  The initial check is the optimization, while the "signal, then | 
|  | check" is the true synchronization.  I use this style all over the place | 
|  | (check out the event delivery with concurrent vcore yields code). | 
|  |  | 
|  | Because of this optimization, we have a slightly odd interface: __reg is | 
|  | called with the CV lock held, and dereg_ is not.  There are some lock ordering | 
|  | issues.  Without the optimization, we could simply make the order {list lock, | 
|  | CV lock}, so that the aborter can use the list lock to keep a kthread/cv alive | 
|  | (one of the struct cv_lookup_elm in the code, to be precise) while it | 
|  | cv_broadcasts.  However, the "check first" optimization would need to lock and | 
|  | unlock the CV a couple times, which seems excessive.  So we switch the lock | 
|  | order to {CV, list lock}, and the aborter doesn't hold the list lock while | 
|  | signalling the CV.  Instead, it keeps the cle alive with a flag that dereg_ | 
|  | spins on.  This spinwait is why dereg can't hold the CV lock: it would create | 
|  | a circular dependency. |