| kthreads.txt |
| Barret Rhoden |
| |
| What Are They, And Why? |
| ------------------------------- |
| Eventually a thread of execution in the kernel will want to block. This means |
| that the thread is unable to make forward progress and something else ought |
| to run - the common case for this is when we wait on an IO operation. This gets |
| trickier when a function does not know if a function it calls will block or not. |
| Sometimes they do, sometimes they don't. |
| |
| The critical feature is not that we want to save the registers, but that we want |
| to preserve the stack and be able to use it independently of whatever else we do |
| on that core in the interim time. If we knew we would be done with and return |
| from whatever_else() before we needed to continue the current thread of |
| execution, we could simply call the function. Instead, we want to be able to |
| run the old context independently of what else is running (which may be a |
| process). |
| |
| We call this suspended context and the associated information a kthread, managed |
| by a struct kthread. It's the bare minimum needed for the kernel to stop and |
| restart a thread of execution. It holds the registers, stack pointer, PC, |
| struct proc* if applicable, stacktop, and little else. There is no silly_state |
| / floating point state, or anything else. Its address space is determined from |
| which process context (possibly none) that was running. |
| |
| We also get a few other benefits, such as the ability to pick and choose which |
| kthreads to run where and when. Users of kthreads should not assume that the |
| core_id() stayed the same across blocking calls. |
| |
| We can also use this infrastructure in other cases where we might want to start |
| on a new stack. One example is when we deal with low memory. We may have to do |
| a lot of work, but only need to do a little to allow the original thread (that |
| might have failed on a page_alloc) to keep running, while we want the memory |
| freer to keep running too (or later) from where it left off. In essence, we |
| want to fork, work, and yield or run on another core. The kthread is just a |
| means of suspending a call stack and a context for a little while. |
| |
| Side Note: |
| ----------- |
| Right now, blocking a kthread is an explicit action. Some function realizes it |
| can't make progress (like waiting on a block device), so it sleeps on something |
| (for now a semaphore), and gets woken up when it receives its signal. This |
| differs from processes, which can be stopped and suspended at any moment |
| (pagefault is the classic example). In the future, we could make kthreads be |
| preemptable (timer interrupt goes off, and we choose to suspend a kthread), but |
| even then kthreads still have the ability to turn off interrupts for tricky |
| situations (like suspending the kthread). The analog in the process code is |
| disabling notifications, which dramatically complicates its functions (compare |
| the save and pop functions for _ros_tf and _kernel_tf). Furthermore, when a |
| process disables notifications, it still doesn't mean it is running without |
| interruptions (it looks like that to the vcore). When the kernel disables |
| interrupts, it really is running. |
| |
| What About Events? |
| ------------------------------- |
| Why not just be event driven for all IO? Why do we need these kernel threads? |
| In short, IO isn't as simple as "I just want a block and when it's done, run a |
| function." While that is what the block device driver will do, the subsystems |
| actually needing the IO are much simpler if they are threaded. Consider the |
| potentially numerous blocking IO calls involved in opening a file. Having a |
| continuation for each one of those points in the call graph seems like a real |
| pain to code. Perhaps I'm not seeing it, but if you're looking for a simple, |
| light mechanism for keeping track of what work you need to do, just use a stack. |
| Programming is much simpler, and it costs a page plus a small data structure. |
| |
| Note that this doesn't mean that all IO needs to use kthreads, just that some |
| will really benefit from it. I plan to make the "last part" of some IO calls |
| more event driven. Basically, it's all just a toolbox, and you should use what |
| you need. |
| |
| Freeing Stacks and Structs |
| ------------------------------- |
| When we restart a kthread, we have to be careful about freeing the old stack and |
| the struct kthread. We need to delay the freeing of both of these until after |
| we longjmp to the new kthread. We can't free the kthread before popping it, |
| and we are on the stack we need to free (until we pop to the new stack). |
| |
| To deal with this, we have a "spare" kthread per core, which gets assigned as |
| the spare when we restart a previous kthread. When making/suspending a kthread, |
| we'll use this spare. When restarting one, we'll free the old spare if it |
| exists and put ours there. One drawback is that we potentially waste a chunk of |
| memory (1 page + a bit per core, worst case), but it is a nice, simple solution. |
| Also, it will cut down on contention for free pages and the kthread_kcache, |
| though this won't help with serious contention issues (which we'll deal with |
| eventually). |
| |
| What To Run Next? |
| ------------------------------- |
| When a kthread suspends, what do we run next? And how do we know what to run |
| next? For now, we call smp_idle() - it is what you do when you have nothing |
| else to do, or don't know what to do. We could consider having sleep_on() take |
| a function pointer, but when we start hopping stacks, passing that info gets |
| tricky. And we need to make a decision about which function to call quickly (in |
| the code. I don't trust the compiler much). We can store the function pointer |
| at the bottom of the future stack and extract it from there. Or we could put it |
| in per_cpu_info. Or we can send ourselves a routine kernel message. |
| |
| Regardless of where we put it, we ought to call smp_idle() (or something |
| similar) before calling it, since we need to make sure that whatever we call |
| right after jumping stacks never returns. It's more flexible to allow a |
| function that returns for the func *, so we'll use smp_idle() as a level of |
| indirection. |
| |
| Semaphore Stuff |
| ------------------------------- |
| We use the semaphore (defined in kthread.h) for kthreads to sleep on and wait |
| for a signal. It is possible that the signal wins the race and beats the call |
| to sleep_on(). The semaphore handles this by "returning false." You'll notice |
| that we don't actually call __down_sem(), but instead "build it in" to |
| sleep_on(). I didn't want to deal with returning a bool (even if it was an |
| inline), because I want to minimize the amount of stuff we do with potential |
| stack variables (I don't trust the register variable). As soon as we unlock, |
| the kthread could be restarted (in theory), and it could start to clobber the |
| stack in later function calls. |
| |
| So it is possible that we lose the semaphore race and shouldn't sleep. We |
| unwind the sleep prep work. An alternative was to only do the prep work if we |
| won the race, but that would mean we have to do a lot of work in that delicate |
| period of "I'm on the queue but it is unlocked" - work that requires touching |
| the stack. Or we could just hold the lock for a longer period of time, which |
| I don't care to do. What we do now is try and down the semaphore early (the |
| early bailout), and if it fails then try to sleep (unlocked). If it then |
| loses the race (unlikely), it can manually unwind. |
| |
| Note that a lot of this is probably needless worry - we have interrupts disabled |
| for most of sleep_on(), though arguably we can be a little more careful with |
| pcpui->spare and move the disable_irq() down to right before setjmp(). |
| |
| What's the Deal with Stacks/Stacktops? |
| ------------------------------- |
| When the kernel traps from userspace, it needs to know what to set the kernel |
| stack pointer to. In x86, it looks in the TSS. In riscv, we have a data |
| structure tracking that info (core_stacktops). One thing I considered was |
| migrating the kernel from its boot stacks (x86, just core0, riscv, all the cores |
| have one). Instead, we just make sure the tables/TSS are up to date right away |
| (before interrupts or traps can come in for x86, and right away for riscv). |
| These boot stacks aren't particularly special, just note they are in the program |
| data/bss sections and were never originally added to a free list. But they can |
| be freed later on. This might be an issue in some places, but those places |
| ought to be fixed. |
| |
| There is also some implications about PGSIZE stacks (specifically in the |
| asserts, how we alloc only one page, etc). The bootstacks are bigger than a |
| page (for now), but in general we don't want to have giant stacks (and shouldn't |
| need them - note linux runs with 4KB stacks). In the future (long range, when |
| we're 64 bit), I'd like to put all kernel stacks high in the address space, with |
| guard pages after them. This would require a certain "quiet migration" to the |
| new locations for the bootstacks (though not a new page - just a different |
| virtual address for the stacks (not their page-alloced KVA). A bunch of minor |
| things would need to change for that, so don't hold your breath. |
| |
| So what about stacktop? It's just the top of the stack, but sometimes it is the |
| stack we were on (when suspending the kthread), other times kthread->stacktop |
| is just a scrap page's top. |
| |
| What's important when suspending is that the current stack is not |
| used in future traps - that it doesn't get clobbered. That's why we need to |
| find a new stack and set it as the current stacktop. We also need to 'save' |
| the stack page of the old kthread - we don't want it to be freed, since we |
| need it later. When starting a kthread, I don't particularly care about which |
| stack is now the default stack. The sleep_on() assumes it was the kthread's, |
| so unless we always have a default one that is only used very briefly and |
| never blocked on, (which requires a stack jump), we ought to just have a |
| kthread run with its stack as the default stacktop. |
| |
| When restarting a kthread, we eventually will use its stack, instead of the |
| current one, but we can't free the current stack until after we actually |
| longjmp() to it. This is the same problem as with the struct kthread dealloc. |
| So we can have the kthread (which we want to free later) hold on to the page we |
| wanted to dealloc. Likewise, when we would need a fresh kthread, we also need a |
| page to use as the default stacktop. So if we had a cached kthread, we then use |
| the page that kthread was pointing to. NOTE: the spare kthread struct is not |
| holding the stack it was originally saved with. Instead, it is saving the page |
| of the stack that was running when that kthread was reactivated. It's spare |
| storage for both the struct and the page, but they aren't linked in any |
| meaningful way (like it is the stack of the page). That linkage is only true |
| when a kthread is being used (like in a semaphore queue). |
| |
| Current and Process Contexts |
| ------------------------------- |
| When a kthread is suspended, should the core stay in process context (if it was |
| before)? Short answer: yes. |
| |
| For vcore local calls (process context, trapped on the calling core), we're |
| giving the core back, so we can avoid TLB shootdowns. Though we do have to |
| incref (which writes a cache line in the proc struct), since we are storing a |
| reference to the proc (and will try to load its cr3 later). While this sucks, |
| keep in mind this is for a blocking IO call (where we couldn't find the page in |
| any cache, etc). It might be a scalability bottleneck, but it also might not |
| matter in any real case. |
| |
| For async calls, it is less clear. We might want to keep processing that |
| process's syscalls, so it'd be easier to keep its cr3 loaded. Though it's not |
| as clear how we get from smp_idle() to a workable function and if it is useful |
| to be in process context until we start processing those functions again. Keep |
| in mind that normally, smp_idle() shouldn't be in any process's context. I'll |
| probably write something later that abandons any context before halting to make |
| sure processes die appropriately. But there are still some unresolved issues |
| that depend on what exactly we want to do. |
| |
| While it is tempting to say that we stay in process context if it was local, but |
| not if it is async, there is an added complication. The function calling |
| sleep_on() doesn't care about whether it is on a process-allocated core or not. |
| This is solvable by using per_cpu_info(), and will probably work its way into a |
| future patch, regardless of whether or not we stay in process context for async |
| calls. |
| |
| As a final case, what will we do for processes that were interrupted by |
| something that wants to block, but wasn't servicing a syscall? We probably |
| shouldn't have these (I don't have a good example of when we'd want it, and a |
| bunch of reasons why we wouldn't), but if we do, then it might be okay anyway - |
| the kthread is just holding that proc alive for a bit. Page faults are a bit |
| different - they are something the process wants at least. I was thinking more |
| about unrelated async events. Still, shouldn't be a big deal. |
| |
| Kmsgs and Kthreads |
| ------------------------------- |
| Is there a way to mix kernel messages and kthreads? What's the difference, and |
| can one do the other? A kthread is a suspended call-stack and context (thread), |
| stopped in the middle of its work. Kernel messages are about starting fresh - |
| "hey core X, run this function." A kmsg can very easily be a tool used to |
| restart a kthread (either locally or on another core). We do this in the test |
| code, if you're curious how it could work. |
| |
| Note we use the semaphore to deal with races. In test_kthreads(), we're |
| actually using the kmsg to up the semaphore. You just as easily could up the |
| semaphore in one core (possibly in response to a kmsg, though more likely due to |
| an interrupt), and then send the kthread to another core to restart via a kmsg. |
| |
| There's no reason you can't separate the __up_sem() and the running of the |
| kthread - the semaphore just protects you from missing the signal. Perhaps |
| you'll want to rerun the kthread on the physical core it was suspended on! |
| (cache locality, and it might be a legit option to allow processes to say it's |
| okay to take their vcore). Note this may require more bookkeeping in the struct |
| kthread. |
| |
| There is another complication: the way we've been talking about kmsgs (starting |
| fresh), we are talking about *routine* messages. One requirement for routine |
| messages that do not return is that they handle process state. The current |
| kmsgs, such as __death and __preempt are built so they can handle acting on |
| whichever process is currently running. Likewise, __launch_kthread() needs to |
| handle the cases that arise when it runs on a core that was about to run a |
| process (as can often happen with proc_restartcore(), which calls |
| process_routine_kmsg()). Basically, if it was a _S, it just yields the process, |
| similar to what happens in Linux (call schedule() on the way out, from what I |
| recall). If it was a _M, things are a bit more complicated, since this should |
| only happen if the kthread is for that process (and probably a bunch of other |
| things - like they said it was okay to interrupt their vcore to finish the |
| syscall). Note - this might not be accurate anymore (see discussions on |
| current_ctx). |
| |
| To a certain extent, routine kmsgs don't seem like a nice fit, when we really |
| want to be calling schedule(). Though if you think of it as the enactment of a |
| previous scheduling decision (like other kmsgs (__death())), then it makes more |
| sense. The scheduling decision (as of now) was made in the interrupt handler |
| when it decided to send the kernel msg. In the future, we could split this into |
| having the handler make the kthread active, and have the scheduler called to |
| decide where and when to run the kthread. |
| |
| Current_ctx, Returning Twice, and Blocking |
| -------------------------------- |
| One of the reasons for decoupling kthreads from a vcore or the notion of a |
| kernel thread per user processs/task is so that when the kernel blocks (on a |
| syscall or wherever), it can return to the process. This is the essence of the |
| asynchronous kernel/syscall interface (though it's not limited to syscalls |
| (pagefaults!!)). Here is what we want it to be able to handle: |
| - When a process traps (syscall, trap, or actual interrupt), the process regains |
| control when the kernel is done or when it blocks. |
| - Any kernel path can block at any time. |
| - Kernel control paths need to not "return twice", but we don't want to have to |
| go through acrobatics in the code to prevent this. |
| |
| There are a couple of approaches I considered, and it involves the nature of |
| "current_ctx", and a brutal bug. Current_ctx (formerly current_ctx) is a |
| pointer to the trapframe of the process that was interrupted/trapped, and is |
| what user context ought to be running on this core if we return. Current_ctx is |
| 'made' when the kernel saves the context at the top of the interrupt stack (aka |
| 'stacktop'). Then the kernel's call path proceeds down the same stack. This |
| call path may get blocked in a kthread. When we block, we want to restart the |
| current_ctx. There is a coupling between the kthread's stack and the storage of |
| current_ctx (contents, not the pointer (which is in pcpui)). |
| |
| This coupling presents a problem when we are in userspace and get interrupted, |
| and that interrupt wants to restart a kthread. In this case, current_ctx points |
| to the interrupt stack, but then we want to switch to the kthread's stack. This |
| is okay. When that kthread wants to block again, it needs to switch back to |
| another stack. Up until this commit, it was jumping to the top of the old stack |
| it was on, clobbering current_ctx (took about 8-10 hours to figure this out). |
| While we could just make sure to save space for current_ctx, it doesn't solve |
| the problem: namely that the current_ctx concept is not bound to a specific |
| kernel stack (kthread or otherwise). We could have cases where more than one |
| kthread starts up on a core and we end up freeing the page that holds |
| current_ctx (since it is a stack we no longer need). We don't want to bother |
| keeping stacks around just to hold the current_ctx. Part of the nature of this |
| weird coupling is that a given kthread might or might not have the current_ctx |
| at the top of its stack. What a pain in the ass... |
| |
| The right answer is to decouple current_ctx from kthread stacks. There are two |
| ways to do this. In both ways, current_ctx retains its role of the context the |
| kernel restarts (or saves) when it goes back to a process, and is independent of |
| blocking kthreads. SPOILER: solution 1 is not the one I picked |
| |
| 1) All traps/interrupts come in on one stack per core. That stack never changes |
| (regardless of blocking), and current_ctx is stored at the top. Kthreads sort |
| of 'dispatch' / turn into threads from this event-like handling code. This |
| actually sounds really cool! |
| |
| 2) The contents of current_ctx get stored in per-cpu-info (pcpui), thereby |
| clearly decoupling it from any execution context. Any thread of execution can |
| block without any special treatment (though interrupt handlers shouldn't do |
| this). We handle the "returning twice" problem at the point of return. |
| |
| One nice thing about 1) is that it might make stack management easier (we |
| wouldn't need to keep a spare page, since it's the default core stack). 2) is |
| also tricky since we need to change some entry point code to write the TF to |
| pcpui (or at least copy-out for now). |
| |
| The main problem with 1) is that you need to know and have code to handle when |
| you "become" a kthread and are allowed to block. It also prevents us making |
| changes such that all executing contexts are kthreads (which sort of is what is |
| going on, even if they don't have a struct yet). |
| |
| While considering 1), here's something I wanted to say: "every thread of |
| execution, including a KMSG, needs to always return (and thus not block), or |
| never return (and be allowed to block)." To "become" a kthread, we'd need to |
| have code that jumps stacks, and once it jumps it can never return. It would |
| have to go back to some place such as smp_idle(). |
| |
| The jumping stacks isn't a problem, and whatever we jump to would just have to |
| have smp_idle() at the end. The problem is that this is a pain in the ass to |
| work with in reality. But wait! Don't we do that with batched syscalls right |
| now? Yes (though we should be using kmsgs instead of the hacked together |
| workqueue spread across smp_idle() and syscall.c), and it is a pain in the ass. |
| It is doable with syscalls because we have that clearly defined point |
| (submitting vs processing). But what about other handlers, such as the page |
| fault handler? It could block, and lots of other handlers could block too. All |
| of those would need to have a jump point (in trap.c). We aren't even handling |
| events anymore, we are immediately jumping to other stacks, using our "event |
| handler" to hold current_ctx and handle how we return to current_ctx. Don't |
| forget about other code (like the boot code) that wants to block. Simply put, |
| option 1 creates a layer that is a pain to work with, cuts down on the |
| flexibility of the kernel to block when it wants, and doesn't handle the issue |
| at its source. |
| |
| The issue about having a defined point in the code that you can't return back |
| across (which is where 1 would jump stacks) is about "returning twice." Imagine |
| a syscall that doesn't block. It traps into the kernel, does its work, then |
| returns. Now imagine a syscall that blocks. Most of these calls are going to |
| block on occasion, but not always (imagine the read was filled from the page |
| cache). These calls really need to handle both situations. So in one instance, |
| the call blocks. Since we're async, we return to userspace early (pop the |
| current_ctx). Now, when that kthread unblocks, its code is going to want to |
| finish and unroll its stack, then pop back to userspace. This is the 'returning |
| twice' problem. Note that a *kthread* never returns twice. This is what makes |
| the idea of magic jumping points we can't return back across (and tying that to |
| how we block in the kernel) painful. |
| |
| The way I initially dealt with this was by always calling smp_idle(), and having |
| smp_idle decide what to do. I also used it as a place to dispatch batched |
| syscalls, which is what made smp_idle() more attractive. However, after a bit, |
| I realized the real nature of returning twice: current_ctx. If we forget about |
| the batching for a second, all we really need to do is not return twice. The |
| best place to do that is at the place where we consider returning to userspace: |
| proc_restartcore(). Instead of calling smp_idle() all the time (which was in |
| essence a "you can now block" point), and checking for current_ctx to return, |
| just check in restartcore to see if there is a tf to restart. If there isn't, |
| then we smp_idle(). And don't forget to handle the cases where we want to start |
| and scp_ctx (which we ought to point current_ctx to in proc_run()). |
| |
| As a side note, we ought to use kmsgs for batched syscalls - it will help with |
| preemption latencies. At least for all but the first syscall (which can be |
| called directly). Instead of sending a retval via current_ctx about how many |
| started, just put that info in the syscall struct's flags (which might help the |
| remote syscall case - no need for a response message, though there are still a |
| few differences (no failure model other than death)). |
| |
| Note that page faults will still be tricky, but at least now we don't have to |
| worry about points of no return. We just check if there is a current_ctx to |
| restart. The tricky part is communicating that the PF was sorted when there |
| wasn't an explicit syscall made. |
| |
| |
| Aborting Syscalls (2013-11-22) |
| ------------------------------- |
| On occasion, userspace would like to abort a syscall, specifically ones that |
| are listening on sockets/conversations where no one will ever answer. |
| |
| We have limited support for aborting syscalls. Kthreads that are in |
| rendez_sleep() (common for anything in the 9ns chunk of the kernel, which |
| includes any conversation listens) can be aborted. They'll return with an |
| error string to userspace. |
| |
| The easier part is the rules for kernel code to be abortable: |
| - Restore your invariants with waserror() before calling rendez_sleep(). |
| - That's really it. |
| So if you're holding a qlock, put your qunlock() code and any other unwinding |
| (such as a kfree()) in a waserror() catch. As it happens, it looks like plan9 |
| already did that (at least for the rendez in listen). And, as always, you |
| can't hold a spinlock when blocking, regardless of aborting calls or anything. |
| |
| I don't want arbitrary sleeps to be abortable. For instance, if a kthread is |
| waiting on an arbitrary semaphore/qlock, we won't allow an abort. The |
| reasoning is that the kthread will eventually acquire the qlock - we're not |
| waiting on external sources to wake up. That's not 100% true - a kthread |
| could be blocked on a qlock, and the qlock holder could be abortable. In the |
| future, we could build some sort of "abort inheritance", usable by root or |
| something (danger of aborting another process's kthread). Alternatively, we |
| could make qlocks abortable too, though that would require all qlocking code |
| to be unwindable. |
| |
| The harder part to syscall aborting is safely waking a kthread. There are |
| several layers to go through from uthread or syscall down to the condition |
| variable a kthread is sleeping on. Given a uthread, find its syscall. Given |
| a syscall, find its kthread. Given the kthread, find the CV. And during all |
| of these, syscalls complete concurrently, kthreads get repurposed for other |
| syscalls, CVs could be freed (though that doesn't happen). Syscalls are often |
| on stacks, so when they complete, the memory is both gibberish and potentially |
| in use. |
| |
| Ultimately, I decided on a system of "safe abort attempts", where it is |
| harmless to be wrong with an attempt. Instead of dealing with the races |
| associated with memory freeing and syscalls completing, the aborts will only |
| work if it is safe to work (using a lookup via pointer, and only dereferencing |
| if the lookup succeeds). |
| |
| As it stands now, all abortable kthreads/sleepers/syscalls are on a per-proc |
| list, and we can lookup by struct syscall*. They are only on the list when |
| they are abortable (the CV can be poked), and the invariant is that when they |
| are on the list, they are in a state that can be safely aborted: the kthread |
| is working on the syscall, it hasn't unwound, it is still in rendez_sleep(), |
| the CV is safe, etc. The details of this protection are sorted out with |
| __reg_abortable_cv() and dereg_abortable_cv() (since it's really the condition |
| variable that we're trying to find). So from the kernel side, nothing bad can |
| happen if you ask to abort an arbitrary struct syscall*. |
| |
| The actual abort takes the "write/signal, then wake" method. The aborter |
| tracks down the kthread via the lookup, the success of which guarantees the |
| sleeper is in rendez_sleep() (or similar sleep paths), marks "SC_ABORT", |
| (barriers), and attempts to wake the kthread (cv_broadcast, since we need to |
| be sure we woke the right kthread). |
| |
| On the user side, we set an alarm to run an event handler that will cancel our |
| syscall. The alarm stuff is fairly standard (runs in vcore context). |
| Userspace no longer has the concern of syscalls completing while they abort, |
| since the kernel will only abort syscalls that are abortable. However, it may |
| have issues (in theory) with aborting future syscalls. If the alarm goes off |
| when the uthread is in another later syscall (which may happen to use the same |
| struct syscall*), then we could accidentally abort the wrong call. There's an |
| aspect of time associated with the first abort alarm handler. This is |
| relatively easy to handle: just turn off the alarm before reusing that syscall |
| struct for a syscall. This relies on a property of the alarms: that when |
| deregistering completes, the alarm handler will not be running concurrently. |
| Incidentally, there is *another* minor trick here: the uthread when adjusting |
| the alarm will issue a syscall, possibly reusing its old sysc*, but that will |
| be *after* deregistering its original alarm: the point at which we could have |
| potentially accidentally cancelled an arbitrary syscall. Also note that the |
| call to change the kernel alarm wouldn't actually block and become abortable, |
| but regardless, we're safe. |
| |
| There are a couple downsides to the "safe abort attempts" approach. We can |
| only abort syscalls when they are at a certain point - if they aren't |
| currently sleeping, the call will fail. Technically, the abort could take |
| effect later on in the life of a syscall (the aborter flags the kthread to |
| abort concurrent with the kthread waking up naturally, and then the call |
| aborts on the next rendez_sleep that needs to block). Related to this |
| limitation, userspace must keep attempting to cancel a syscall until it |
| succeeds. It may also be told an abort succeeded, even if the call actually |
| completes (the aborter flags the kthread, the rendez wakes naturally, and the |
| kthread never blocks again). Ultimately, we can't "fire and forget" our abort |
| attempt. It's not a huge problem though, and is less of a problem than my |
| older approaches that didn't have this problem. |
| |
| For instance, the original idea I had was for userspace to flag the syscall |
| (flags |= SC_ABORT). It could do this at any time. Whenever the kthread was |
| going to block in an abortable location (e.g. rendez_sleep()), it would see |
| the flag and abort. It might already be asleep, so we would also provide a |
| syscall that would 'kick' the kthread responsible for some sysc*, to wake it |
| up to see the flag and abort. The first problem was writing to the sysc |
| flags. Unless we know the memory is actually the syscall we want, this could |
| result in randomly writing to memory (such as a uthread's stack). I ran into |
| similar issues in the kernel: you can't touch a kthread struct unless you know |
| it is the kthread you want. |
| |
| Once I started dealing with the syscall -> kthread mapping, it became clear |
| I'd need a per-proc lookup service in the kernel, which acts as a way to lock |
| a reference to the kthread. I could solve the 'kthread memory safety' problem |
| by looking up by reference, similar to how pid2proc works. Analogously, by |
| changing the interface for sys_abort_syscall() to be a "lookup" approach, I |
| solve the struct syscall * memory problem. |
| |
| As a smaller note, I considered registering every kthread with the process |
| right away (specifically, when we link the syscall to the kthread->sysc) for |
| the sysc->kthread lookup service. But this would get expensive, since every |
| syscall pays the lookup tax (and we'd need to worry about scaling). We want |
| syscalls to be fast, but the infrequent aborts can be expensive. The obvious |
| change was to only save the abortable kthreads. The tradeoff is that we can't |
| flag syscalls for aborting unless they are in an abortable state. This |
| requires multiple pokes by userspace. In theory, they would have to deal with |
| that scenario anyways (in case they attempt to abort before we even register |
| in the first place). |
| |
| As another side note, if userspace ever has a struct syscall allocator, for |
| use in async (non-uthread stack-based) syscalls, we'll need to not reuse a |
| syscall struct until after the cancel alarm has been disarmed. Right now we |
| do this by not having the uthread issue another syscall til after the disarm, |
| since uthread stack-based syscalls are inherently bound to the uthread. A |
| simple solution would be to have a per-uthread syscall struct, which that |
| uthread uses preferentially, and the sysc is only freed when the uthread is |
| freed. Not only would this scale better than accessing the sysc allocator for |
| every syscall, but also there is no worry of reuse til the uthread disarms and |
| exits. |
| |
| It is a userspace bug for a uthread to set the alarm and not unset it before |
| either making a syscall or exiting. The root issue of that potential bug is |
| that someone (alarm handler) holds a pointer to a uthread, with the intent of |
| cancelling its syscall, and we need to somehow take back that pointer (cancel |
| the alarm) before reusing the syscall or freeing the uthread. I considered |
| not making the alarm guarantee that when the cancel returns, the handler isn't |
| running concurrently. We could handle the races in the alarm handler and in |
| the cancel code, but it's an added hassle that isn't clearly needed. This |
| does mean we have to run the alarm handlers serially, while holding the alarm |
| lock. I'm fine with this, for now. Perhaps if users want more concurrency, |
| their handlers can spawn or wake up a uthread. |
| |
| It is also worth noting that many rendez_sleep() calls actually return right |
| away. This is common if some data is already in the queue (or whatever the |
| condition is that we want to conditionally sleep on). Since registration is a |
| little bit heavier than just locking the CV, I use the classic "check, signal, |
| check again" style, where we check cond, then register, and then check cond |
| for real. The initial check is the optimization, while the "signal, then |
| check" is the true synchronization. I use this style all over the place |
| (check out the event delivery with concurrent vcore yields code). |
| |
| Because of this optimization, we have a slightly odd interface: __reg is |
| called with the CV lock held, and dereg_ is not. There are some lock ordering |
| issues. Without the optimization, we could simply make the order {list lock, |
| CV lock}, so that the aborter can use the list lock to keep a kthread/cv alive |
| (one of the struct cv_lookup_elm in the code, to be precise) while it |
| cv_broadcasts. However, the "check first" optimization would need to lock and |
| unlock the CV a couple times, which seems excessive. So we switch the lock |
| order to {CV, list lock}, and the aborter doesn't hold the list lock while |
| signalling the CV. Instead, it keeps the cle alive with a flag that dereg_ |
| spins on. This spinwait is why dereg can't hold the CV lock: it would create |
| a circular dependency. |