| FD Taps |
| =========================== |
| 2015-07-27 Barret Rhoden (brho) |
| |
| Contents |
| --------------------------- |
| What are FD Taps? |
| Where are the FD Taps? |
| |
| |
| What are FD Taps? |
| --------------------------- |
| |
| Where are the FD Taps? |
| --------------------------- |
| ### Basics ### |
| In Linux, the epoll blob is attached to the File (I think, this is the struct |
| eventpoll). Linux can get from a sock -> socket -> file -> eventpoll. From the |
| lower levels of the networking stack, you can get all the way to the accounting |
| info for epoll. |
| |
| In Akaros, and in Plan 9, the analogous object to the file is the chan. |
| However, in the networking stack, the conversation (like a struct sock) does not |
| keep a pointer to it's chan. Further, there is not a 1:1 correspondence between |
| convs and chans: there could be several chans using the same conv, similar to |
| using several OS files for the same underlying disk file (inode). Although that |
| might be a bad idea for network connections, it'd be nice to not have FD Taps |
| assume anything about the underlying device. So for Akaros, we want to have the |
| tap somewhere within the device. For #I, that probably means hanging off the |
| conversation. For #M (devmnt), it would be some other struct, where the tap is |
| translated into a 9p message. |
| |
| Another aspect of this issue is that these are "FD" taps, not "file/chan" taps. |
| If you read through the Q&A for epoll's man page, there are a bunch of weird |
| conditions that result from having the tap on the file. This is due to having |
| multiple FDs point to the same file. |
| |
| The approach I took in Akaros was to have the tap in both the FD and within the |
| device (the conversation). If we're declaring interest in an FD, the FD is a |
| reasonable place to track that interest. We also need to track the tap within |
| the device, as mentioned above. Now we need to sort out the registration of |
| taps and avoid any concurrency issues. |
| |
| ### Code Issues ### |
| We need to worry about a few things. Overall, we want to register a tap on an |
| FD (struct file_desc), and that registration needs to go through the device. |
| Perhaps the device doesn't support taps, or it doesn't support the event filters |
| we requested. So we need to handle registration failure. We also need to |
| handle concurrent deregistrations, re-registrations, opens, and and closes. |
| |
| A basic approach would be to lock the FD table, make sure there's only one tap, |
| register the new one with the device, insert into the table, and unlock. The |
| lock protects adding the tap (can only have one, racing on the FD's tap |
| pointer), concurrent tap removals, enforces the FD points to a file, and |
| protects against FD closes. |
| |
| But the problem is the FD table lock is a spinlock, and we don't want it to be |
| more than that. Device registration could be a blocking call. So we need to |
| come up with something else. Part of the problem involves syncing with two |
| places: the FD and the conv. |
| |
| At this point I thought about putting the tap in the device, and not the FD at |
| all. Deregistration becomes tricky. We want to destroy the tap when the FD |
| closes, or at least turn it off. Say we do something like "after closing, |
| deregister the tap". We could come up with enough info to the device to make it |
| work - we'd probably want to pass in the FD (integer), proc*, and probably the |
| chan. However, once we closed, the FD is now free, and we could have something |
| like: |
| Trying to close: User opens and taps a conv: |
| close(5) (FD 5 was 1/data with a tap) |
| open(/net/tcp/1/data) (get 5 back) |
| register_fd_tap(5) (two taps on 5, might fail!) |
| deregister_fd_tap(5) |
| cclose (needed to keep the chan alive) |
| At the end, we might have no taps on 5. Or if we opened 2/data instead of |
| 1/data, the deregister_fd_tap call will accidentally deregister from the new FD |
| 5 instead of the old one, and the old one will still be active! |
| |
| Maybe we deregister first, then close, to avoid FD reuse problems. Remember |
| that the only locking goes on in close. Now consider: |
| Trying to close: User tries to add (another) tap: |
| deregister_fd_tap(5) |
| register_fd_tap(5) |
| close(5) (was 1/data with a tap) |
| Now we just closed with a tap still registered. Eventually, that FD tap might |
| fire. Spurious events are okay, but we could run into issues. Say the evq in |
| the original tap is no longer valid. It was buggy for the user to perform this |
| operation, but there are probably other issues. And we didn't even get in to |
| how registration works (register before putting it in the FD table? After? |
| What about concurrent ops?) |
| |
| We could flag the FD as 'untappable'. But it seems that we're going to need to |
| sync with the FD table regardless of where the tap exists. We might as well go |
| back to the original plan of having the tap hang off the FD in some manner. It |
| makes the most sense, aesthetically, since the FD tap is an attribute of the FD. |
| |
| One trick that would help with FD reuse is to have the device op for |
| register/deregister take the fd_tap pointer. Not only can we squeeze more info |
| in the tap without mucking with the function signature, but the main benefit is |
| that so long as the FD tap is allocated, it is unique. FD = 5 can be reused. |
| FD_tap = 0xffff800012345678 is unique. |
| |
| However, simply adding the tap pointer to register() isn't enough. Say we did |
| the basic "lock the FD table, (basic checks), attach the pointer, unlock, call |
| device register, then free it if register fails", and a dereg locks the table, |
| yanks it out, then call device dereg, then frees. We still have some issues: |
| |
| - What if a deregister occurs while we are still trying to register and failed? |
| Who actually frees the FD tap? We can't completely free it while the other op |
| is in progress. That sounds like a job for a kref on the FD tap. |
| |
| - What if we added the tap, then go to register, then it fails, then we have a |
| concurrent close try to deregister it. Now we have concurrent deregisters. |
| We can deal with this by having the device op accept spurious deregisters, but |
| that's ugly (and unnecessary, see below). |
| |
| - What if a legit deregister occurs while we are registering and eventually will |
| succeed? Say: |
| sys_register_fd_tap(0xf00) |
| adds to fdset, unlocks |
| close(5) |
| yanks 0xf00 from the fd |
| deregister tap 0xf00 (fails, spurious) |
| register tap(chan, 0xf00) |
| free 0xf00? |
| The deregister fails, since it was never there (remember we said it could have |
| spurious deregister calls). Then register happens. But the FD is closed! And |
| then who is freeing the tap? Hopefully we don't free it while the device still |
| has a pointer... |
| |
| The issue here is the assumption that the tap would have been registered. Since |
| we unlock the FD table, we can violate those assumptions. We want to guarantee |
| the order of register/deregister operations, such that register happens before |
| deregister. |
| |
| It turns out that the kref can do this too! The trick is to use the release |
| operation to do the deregistration. That ensures that so long as a reference is |
| held, we won't call deregister *and* that deregister will happen exactly once. |
| close() simply becomes "lock the FDT, remove the tap, unlock, decref": extremely |
| simple. Note that decref could trigger the release method which could then |
| sleep (since it calls into a device), so we decref outside the lock. register() |
| ups the refcnt by two, one for itself to keep the tap alive (and preventing a |
| concurrent dereg) and one for the pointer in the FD table. |
| |
| Note that as soon as we unlock, our tap could be decref'd and a completely new |
| tap could be added and registered for that FD. That means the following can |
| happen: |
| lock FDT |
| add tap 0xf00 to FD 5 |
| unlock FDT |
| lock FDT |
| remove tap from FD 5 |
| unlock FDT |
| decref 0xf00 |
| (new syscall) |
| lock FDT |
| add tap 0xbar to FD 5 |
| unlock FDT |
| register tap 0xbar for FD 5 |
| register tap 0xf00 for FD 5 |
| decref and trigger a deregister of f00 |
| |
| In this case the device could see two separate taps (0xf00 and 0xbar) for the |
| same FD (5). It just so happens that one of them will deregister soon. It is |
| also possible for an event to fire between the left column's register and |
| decref, at which point two events would be created (possibly with the same evq |
| and event id). |
| |
| The final case to consider is when registration fails. To keep things simple |
| for the device, we can make sure that we only deregister a tap if our register |
| succeeded. To do this nicely with krefs, we can simply change the release |
| method, based on whether or not registration succeeds. |