| FD Taps | 
 | =========================== | 
 | 2015-07-27 Barret Rhoden (brho) | 
 |  | 
 | Contents | 
 | --------------------------- | 
 | What are FD Taps? | 
 | Where are the FD Taps? | 
 |  | 
 |  | 
 | What are FD Taps? | 
 | --------------------------- | 
 |  | 
 | Where are the FD Taps? | 
 | --------------------------- | 
 | ### Basics ### | 
 | In Linux, the epoll blob is attached to the File (I think, this is the struct | 
 | eventpoll).  Linux can get from a sock -> socket -> file -> eventpoll.  From the | 
 | lower levels of the networking stack, you can get all the way to the accounting | 
 | info for epoll. | 
 |  | 
 | In Akaros, and in Plan 9, the analogous object to the file is the chan. | 
 | However, in the networking stack, the conversation (like a struct sock) does not | 
 | keep a pointer to it's chan.  Further, there is not a 1:1 correspondence between | 
 | convs and chans: there could be several chans using the same conv, similar to | 
 | using several OS files for the same underlying disk file (inode).  Although that | 
 | might be a bad idea for network connections, it'd be nice to not have FD Taps | 
 | assume anything about the underlying device.  So for Akaros, we want to have the | 
 | tap somewhere within the device.  For #I, that probably means hanging off the | 
 | conversation.  For #M (devmnt), it would be some other struct, where the tap is | 
 | translated into a 9p message. | 
 |  | 
 | Another aspect of this issue is that these are "FD" taps, not "file/chan" taps. | 
 | If you read through the Q&A for epoll's man page, there are a bunch of weird | 
 | conditions that result from having the tap on the file.  This is due to having | 
 | multiple FDs point to the same file. | 
 |  | 
 | The approach I took in Akaros was to have the tap in both the FD and within the | 
 | device (the conversation).  If we're declaring interest in an FD, the FD is a | 
 | reasonable place to track that interest.  We also need to track the tap within | 
 | the device, as mentioned above.  Now we need to sort out the registration of | 
 | taps and avoid any concurrency issues. | 
 |  | 
 | ### Code Issues ### | 
 | We need to worry about a few things.  Overall, we want to register a tap on an | 
 | FD (struct file_desc), and that registration needs to go through the device. | 
 | Perhaps the device doesn't support taps, or it doesn't support the event filters | 
 | we requested.  So we need to handle registration failure.  We also need to | 
 | handle concurrent deregistrations, re-registrations, opens, and and closes. | 
 |  | 
 | A basic approach would be to lock the FD table, make sure there's only one tap, | 
 | register the new one with the device, insert into the table, and unlock.  The | 
 | lock protects adding the tap (can only have one, racing on the FD's tap | 
 | pointer), concurrent tap removals, enforces the FD points to a file, and | 
 | protects against FD closes. | 
 |  | 
 | But the problem is the FD table lock is a spinlock, and we don't want it to be | 
 | more than that.  Device registration could be a blocking call.  So we need to | 
 | come up with something else.  Part of the problem involves syncing with two | 
 | places: the FD and the conv. | 
 |  | 
 | At this point I thought about putting the tap in the device, and not the FD at | 
 | all.  Deregistration becomes tricky.  We want to destroy the tap when the FD | 
 | closes, or at least turn it off.  Say we do something like "after closing, | 
 | deregister the tap".  We could come up with enough info to the device to make it | 
 | work - we'd probably want to pass in the FD (integer), proc*, and probably the | 
 | chan.  However, once we closed, the FD is now free, and we could have something | 
 | like: | 
 | 	Trying to close:	User opens and taps a conv: | 
 | 	close(5) (FD 5 was 1/data with a tap) | 
 | 				open(/net/tcp/1/data) (get 5 back) | 
 | 				register_fd_tap(5) (two taps on 5, might fail!) | 
 | 	deregister_fd_tap(5) | 
 | 	cclose (needed to keep the chan alive) | 
 | At the end, we might have no taps on 5.  Or if we opened 2/data instead of | 
 | 1/data, the deregister_fd_tap call will accidentally deregister from the new FD | 
 | 5 instead of the old one, and the old one will still be active! | 
 |  | 
 | Maybe we deregister first, then close, to avoid FD reuse problems.  Remember | 
 | that the only locking goes on in close.  Now consider: | 
 | 	Trying to close:	User tries to add (another) tap: | 
 | 	deregister_fd_tap(5) | 
 | 				register_fd_tap(5) | 
 | 	close(5) (was 1/data with a tap) | 
 | Now we just closed with a tap still registered.  Eventually, that FD tap might | 
 | fire.  Spurious events are okay, but we could run into issues.  Say the evq in | 
 | the original tap is no longer valid.  It was buggy for the user to perform this | 
 | operation, but there are probably other issues.  And we didn't even get in to | 
 | how registration works (register before putting it in the FD table?  After? | 
 | What about concurrent ops?) | 
 |  | 
 | We could flag the FD as 'untappable'.  But it seems that we're going to need to | 
 | sync with the FD table regardless of where the tap exists.  We might as well go | 
 | back to the original plan of having the tap hang off the FD in some manner.  It | 
 | makes the most sense, aesthetically, since the FD tap is an attribute of the FD. | 
 |  | 
 | One trick that would help with FD reuse is to have the device op for | 
 | register/deregister take the fd_tap pointer.  Not only can we squeeze more info | 
 | in the tap without mucking with the function signature, but the main benefit is | 
 | that so long as the FD tap is allocated, it is unique.  FD = 5 can be reused. | 
 | FD_tap = 0xffff800012345678 is unique. | 
 |  | 
 | However, simply adding the tap pointer to register() isn't enough.  Say we did | 
 | the basic "lock the FD table, (basic checks), attach the pointer, unlock, call | 
 | device register, then free it if register fails", and a dereg locks the table, | 
 | yanks it out, then call device dereg, then frees.  We still have some issues: | 
 |  | 
 | - What if a deregister occurs while we are still trying to register and failed? | 
 |   Who actually frees the FD tap?  We can't completely free it while the other op | 
 |   is in progress.  That sounds like a job for a kref on the FD tap. | 
 |  | 
 | - What if we added the tap, then go to register, then it fails, then we have a | 
 |   concurrent close try to deregister it.  Now we have concurrent deregisters. | 
 |   We can deal with this by having the device op accept spurious deregisters, but | 
 |   that's ugly (and unnecessary, see below). | 
 |  | 
 | - What if a legit deregister occurs while we are registering and eventually will | 
 |   succeed?  Say: | 
 | 						sys_register_fd_tap(0xf00) | 
 | 						adds to fdset, unlocks | 
 | 	close(5) | 
 | 	yanks 0xf00 from the fd | 
 | 	deregister tap 0xf00 (fails, spurious) | 
 | 						register tap(chan, 0xf00) | 
 | 	free 0xf00? | 
 | The deregister fails, since it was never there (remember we said it could have | 
 | spurious deregister calls).  Then register happens.  But the FD is closed!  And | 
 | then who is freeing the tap?  Hopefully we don't free it while the device still | 
 | has a pointer... | 
 |  | 
 | The issue here is the assumption that the tap would have been registered.  Since | 
 | we unlock the FD table, we can violate those assumptions.  We want to guarantee | 
 | the order of register/deregister operations, such that register happens before | 
 | deregister. | 
 |  | 
 | It turns out that the kref can do this too!  The trick is to use the release | 
 | operation to do the deregistration.  That ensures that so long as a reference is | 
 | held, we won't call deregister *and* that deregister will happen exactly once. | 
 | close() simply becomes "lock the FDT, remove the tap, unlock, decref": extremely | 
 | simple.  Note that decref could trigger the release method which could then | 
 | sleep (since it calls into a device), so we decref outside the lock.  register() | 
 | ups the refcnt by two, one for itself to keep the tap alive (and preventing a | 
 | concurrent dereg) and one for the pointer in the FD table. | 
 |  | 
 | Note that as soon as we unlock, our tap could be decref'd and a completely new | 
 | tap could be added and registered for that FD.  That means the following can | 
 | happen: | 
 | 	lock FDT | 
 | 	add tap 0xf00 to FD 5 | 
 | 	unlock FDT | 
 | 						lock FDT | 
 | 						remove tap from FD 5 | 
 | 						unlock FDT | 
 | 						decref 0xf00 | 
 | 						(new syscall) | 
 | 						lock FDT | 
 | 						add tap 0xbar to FD 5 | 
 | 						unlock FDT | 
 | 						register tap 0xbar for FD 5 | 
 | 	register tap 0xf00 for FD 5 | 
 | 	decref and trigger a deregister of f00 | 
 |  | 
 | In this case the device could see two separate taps (0xf00 and 0xbar) for the | 
 | same FD (5).  It just so happens that one of them will deregister soon.  It is | 
 | also possible for an event to fire between the left column's register and | 
 | decref, at which point two events would be created (possibly with the same evq | 
 | and event id). | 
 |  | 
 | The final case to consider is when registration fails.  To keep things simple | 
 | for the device, we can make sure that we only deregister a tap if our register | 
 | succeeded.  To do this nicely with krefs, we can simply change the release | 
 | method, based on whether or not registration succeeds. |