Documentation/fd_taps.txt - upstream - Git at Google

 FD Taps
 ===========================
 2015-07-27 Barret Rhoden (brho)

 Contents
 ---------------------------
 What are FD Taps?
 Where are the FD Taps?


 What are FD Taps?
 ---------------------------

 Where are the FD Taps?
 ---------------------------
 ### Basics ###
 In Linux, the epoll blob is attached to the File (I think, this is the struct
 eventpoll).  Linux can get from a sock -> socket -> file -> eventpoll.  From the
 lower levels of the networking stack, you can get all the way to the accounting
 info for epoll.

 In Akaros, and in Plan 9, the analogous object to the file is the chan.
 However, in the networking stack, the conversation (like a struct sock) does not
 keep a pointer to it's chan.  Further, there is not a 1:1 correspondence between
 convs and chans: there could be several chans using the same conv, similar to
 using several OS files for the same underlying disk file (inode).  Although that
 might be a bad idea for network connections, it'd be nice to not have FD Taps
 assume anything about the underlying device.  So for Akaros, we want to have the
 tap somewhere within the device.  For #I, that probably means hanging off the
 conversation.  For #M (devmnt), it would be some other struct, where the tap is
 translated into a 9p message.

 Another aspect of this issue is that these are "FD" taps, not "file/chan" taps.
 If you read through the Q&A for epoll's man page, there are a bunch of weird
 conditions that result from having the tap on the file.  This is due to having
 multiple FDs point to the same file.

 The approach I took in Akaros was to have the tap in both the FD and within the
 device (the conversation).  If we're declaring interest in an FD, the FD is a
 reasonable place to track that interest.  We also need to track the tap within
 the device, as mentioned above.  Now we need to sort out the registration of
 taps and avoid any concurrency issues.

 ### Code Issues ###
 We need to worry about a few things.  Overall, we want to register a tap on an
 FD (struct file_desc), and that registration needs to go through the device.
 Perhaps the device doesn't support taps, or it doesn't support the event filters
 we requested.  So we need to handle registration failure.  We also need to
 handle concurrent deregistrations, re-registrations, opens, and and closes.

 A basic approach would be to lock the FD table, make sure there's only one tap,
 register the new one with the device, insert into the table, and unlock.  The
 lock protects adding the tap (can only have one, racing on the FD's tap
 pointer), concurrent tap removals, enforces the FD points to a file, and
 protects against FD closes.

 But the problem is the FD table lock is a spinlock, and we don't want it to be
 more than that.  Device registration could be a blocking call.  So we need to
 come up with something else.  Part of the problem involves syncing with two
 places: the FD and the conv.

 At this point I thought about putting the tap in the device, and not the FD at
 all.  Deregistration becomes tricky.  We want to destroy the tap when the FD
 closes, or at least turn it off.  Say we do something like "after closing,
 deregister the tap".  We could come up with enough info to the device to make it
 work - we'd probably want to pass in the FD (integer), proc*, and probably the
 chan.  However, once we closed, the FD is now free, and we could have something
 like:
 	Trying to close:	User opens and taps a conv:
 	close(5) (FD 5 was 1/data with a tap)
 				open(/net/tcp/1/data) (get 5 back)
 				register_fd_tap(5) (two taps on 5, might fail!)
 	deregister_fd_tap(5)
 	cclose (needed to keep the chan alive)
 At the end, we might have no taps on 5.  Or if we opened 2/data instead of
 1/data, the deregister_fd_tap call will accidentally deregister from the new FD
 5 instead of the old one, and the old one will still be active!

 Maybe we deregister first, then close, to avoid FD reuse problems.  Remember
 that the only locking goes on in close.  Now consider:
 	Trying to close:	User tries to add (another) tap:
 	deregister_fd_tap(5)
 				register_fd_tap(5)
 	close(5) (was 1/data with a tap)
 Now we just closed with a tap still registered.  Eventually, that FD tap might
 fire.  Spurious events are okay, but we could run into issues.  Say the evq in
 the original tap is no longer valid.  It was buggy for the user to perform this
 operation, but there are probably other issues.  And we didn't even get in to
 how registration works (register before putting it in the FD table?  After?
 What about concurrent ops?)

 We could flag the FD as 'untappable'.  But it seems that we're going to need to
 sync with the FD table regardless of where the tap exists.  We might as well go
 back to the original plan of having the tap hang off the FD in some manner.  It
 makes the most sense, aesthetically, since the FD tap is an attribute of the FD.

 One trick that would help with FD reuse is to have the device op for
 register/deregister take the fd_tap pointer.  Not only can we squeeze more info
 in the tap without mucking with the function signature, but the main benefit is
 that so long as the FD tap is allocated, it is unique.  FD = 5 can be reused.
 FD_tap = 0xffff800012345678 is unique.

 However, simply adding the tap pointer to register() isn't enough.  Say we did
 the basic "lock the FD table, (basic checks), attach the pointer, unlock, call
 device register, then free it if register fails", and a dereg locks the table,
 yanks it out, then call device dereg, then frees.  We still have some issues:

 - What if a deregister occurs while we are still trying to register and failed?
   Who actually frees the FD tap?  We can't completely free it while the other op
   is in progress.  That sounds like a job for a kref on the FD tap.

 - What if we added the tap, then go to register, then it fails, then we have a
   concurrent close try to deregister it.  Now we have concurrent deregisters.
   We can deal with this by having the device op accept spurious deregisters, but
   that's ugly (and unnecessary, see below).

 - What if a legit deregister occurs while we are registering and eventually will
   succeed?  Say:
 						sys_register_fd_tap(0xf00)
 						adds to fdset, unlocks
 	close(5)
 	yanks 0xf00 from the fd
 	deregister tap 0xf00 (fails, spurious)
 						register tap(chan, 0xf00)
 	free 0xf00?
 The deregister fails, since it was never there (remember we said it could have
 spurious deregister calls).  Then register happens.  But the FD is closed!  And
 then who is freeing the tap?  Hopefully we don't free it while the device still
 has a pointer...

 The issue here is the assumption that the tap would have been registered.  Since
 we unlock the FD table, we can violate those assumptions.  We want to guarantee
 the order of register/deregister operations, such that register happens before
 deregister.

 It turns out that the kref can do this too!  The trick is to use the release
 operation to do the deregistration.  That ensures that so long as a reference is
 held, we won't call deregister *and* that deregister will happen exactly once.
 close() simply becomes "lock the FDT, remove the tap, unlock, decref": extremely
 simple.  Note that decref could trigger the release method which could then
 sleep (since it calls into a device), so we decref outside the lock.  register()
 ups the refcnt by two, one for itself to keep the tap alive (and preventing a
 concurrent dereg) and one for the pointer in the FD table.

 Note that as soon as we unlock, our tap could be decref'd and a completely new
 tap could be added and registered for that FD.  That means the following can
 happen:
 	lock FDT
 	add tap 0xf00 to FD 5
 	unlock FDT
 						lock FDT
 						remove tap from FD 5
 						unlock FDT
 						decref 0xf00
 						(new syscall)
 						lock FDT
 						add tap 0xbar to FD 5
 						unlock FDT
 						register tap 0xbar for FD 5
 	register tap 0xf00 for FD 5
 	decref and trigger a deregister of f00

 In this case the device could see two separate taps (0xf00 and 0xbar) for the
 same FD (5).  It just so happens that one of them will deregister soon.  It is
 also possible for an event to fire between the left column's register and
 decref, at which point two events would be created (possibly with the same evq
 and event id).

 The final case to consider is when registration fails.  To keep things simple
 for the device, we can make sure that we only deregister a tap if our register
 succeeded.  To do this nicely with krefs, we can simply change the release
 method, based on whether or not registration succeeds.
	FD Taps
	===========================
	2015-07-27 Barret Rhoden (brho)

	Contents
	---------------------------
	What are FD Taps?
	Where are the FD Taps?


	What are FD Taps?
	---------------------------

	Where are the FD Taps?
	---------------------------
	### Basics ###
	In Linux, the epoll blob is attached to the File (I think, this is the struct
	eventpoll). Linux can get from a sock -> socket -> file -> eventpoll. From the
	lower levels of the networking stack, you can get all the way to the accounting
	info for epoll.

	In Akaros, and in Plan 9, the analogous object to the file is the chan.
	However, in the networking stack, the conversation (like a struct sock) does not
	keep a pointer to it's chan. Further, there is not a 1:1 correspondence between
	convs and chans: there could be several chans using the same conv, similar to
	using several OS files for the same underlying disk file (inode). Although that
	might be a bad idea for network connections, it'd be nice to not have FD Taps
	assume anything about the underlying device. So for Akaros, we want to have the
	tap somewhere within the device. For #I, that probably means hanging off the
	conversation. For #M (devmnt), it would be some other struct, where the tap is
	translated into a 9p message.

	Another aspect of this issue is that these are "FD" taps, not "file/chan" taps.
	If you read through the Q&A for epoll's man page, there are a bunch of weird
	conditions that result from having the tap on the file. This is due to having
	multiple FDs point to the same file.

	The approach I took in Akaros was to have the tap in both the FD and within the
	device (the conversation). If we're declaring interest in an FD, the FD is a
	reasonable place to track that interest. We also need to track the tap within
	the device, as mentioned above. Now we need to sort out the registration of
	taps and avoid any concurrency issues.

	### Code Issues ###
	We need to worry about a few things. Overall, we want to register a tap on an
	FD (struct file_desc), and that registration needs to go through the device.
	Perhaps the device doesn't support taps, or it doesn't support the event filters
	we requested. So we need to handle registration failure. We also need to
	handle concurrent deregistrations, re-registrations, opens, and and closes.

	A basic approach would be to lock the FD table, make sure there's only one tap,
	register the new one with the device, insert into the table, and unlock. The
	lock protects adding the tap (can only have one, racing on the FD's tap
	pointer), concurrent tap removals, enforces the FD points to a file, and
	protects against FD closes.

	But the problem is the FD table lock is a spinlock, and we don't want it to be
	more than that. Device registration could be a blocking call. So we need to
	come up with something else. Part of the problem involves syncing with two
	places: the FD and the conv.

	At this point I thought about putting the tap in the device, and not the FD at
	all. Deregistration becomes tricky. We want to destroy the tap when the FD
	closes, or at least turn it off. Say we do something like "after closing,
	deregister the tap". We could come up with enough info to the device to make it
	work - we'd probably want to pass in the FD (integer), proc*, and probably the
	chan. However, once we closed, the FD is now free, and we could have something
	like:
	Trying to close: User opens and taps a conv:
	close(5) (FD 5 was 1/data with a tap)
	open(/net/tcp/1/data) (get 5 back)
	register_fd_tap(5) (two taps on 5, might fail!)
	deregister_fd_tap(5)
	cclose (needed to keep the chan alive)
	At the end, we might have no taps on 5. Or if we opened 2/data instead of
	1/data, the deregister_fd_tap call will accidentally deregister from the new FD
	5 instead of the old one, and the old one will still be active!

	Maybe we deregister first, then close, to avoid FD reuse problems. Remember
	that the only locking goes on in close. Now consider:
	Trying to close: User tries to add (another) tap:
	deregister_fd_tap(5)
	register_fd_tap(5)
	close(5) (was 1/data with a tap)
	Now we just closed with a tap still registered. Eventually, that FD tap might
	fire. Spurious events are okay, but we could run into issues. Say the evq in
	the original tap is no longer valid. It was buggy for the user to perform this
	operation, but there are probably other issues. And we didn't even get in to
	how registration works (register before putting it in the FD table? After?
	What about concurrent ops?)

	We could flag the FD as 'untappable'. But it seems that we're going to need to
	sync with the FD table regardless of where the tap exists. We might as well go
	back to the original plan of having the tap hang off the FD in some manner. It
	makes the most sense, aesthetically, since the FD tap is an attribute of the FD.

	One trick that would help with FD reuse is to have the device op for
	register/deregister take the fd_tap pointer. Not only can we squeeze more info
	in the tap without mucking with the function signature, but the main benefit is
	that so long as the FD tap is allocated, it is unique. FD = 5 can be reused.
	FD_tap = 0xffff800012345678 is unique.

	However, simply adding the tap pointer to register() isn't enough. Say we did
	the basic "lock the FD table, (basic checks), attach the pointer, unlock, call
	device register, then free it if register fails", and a dereg locks the table,
	yanks it out, then call device dereg, then frees. We still have some issues:

	- What if a deregister occurs while we are still trying to register and failed?
	Who actually frees the FD tap? We can't completely free it while the other op
	is in progress. That sounds like a job for a kref on the FD tap.

	- What if we added the tap, then go to register, then it fails, then we have a
	concurrent close try to deregister it. Now we have concurrent deregisters.
	We can deal with this by having the device op accept spurious deregisters, but
	that's ugly (and unnecessary, see below).

	- What if a legit deregister occurs while we are registering and eventually will
	succeed? Say:
	sys_register_fd_tap(0xf00)
	adds to fdset, unlocks
	close(5)
	yanks 0xf00 from the fd
	deregister tap 0xf00 (fails, spurious)
	register tap(chan, 0xf00)
	free 0xf00?
	The deregister fails, since it was never there (remember we said it could have
	spurious deregister calls). Then register happens. But the FD is closed! And
	then who is freeing the tap? Hopefully we don't free it while the device still
	has a pointer...

	The issue here is the assumption that the tap would have been registered. Since
	we unlock the FD table, we can violate those assumptions. We want to guarantee
	the order of register/deregister operations, such that register happens before
	deregister.

	It turns out that the kref can do this too! The trick is to use the release
	operation to do the deregistration. That ensures that so long as a reference is
	held, we won't call deregister and that deregister will happen exactly once.
	close() simply becomes "lock the FDT, remove the tap, unlock, decref": extremely
	simple. Note that decref could trigger the release method which could then
	sleep (since it calls into a device), so we decref outside the lock. register()
	ups the refcnt by two, one for itself to keep the tap alive (and preventing a
	concurrent dereg) and one for the pointer in the FD table.

	Note that as soon as we unlock, our tap could be decref'd and a completely new
	tap could be added and registered for that FD. That means the following can
	happen:
	lock FDT
	add tap 0xf00 to FD 5
	unlock FDT
	lock FDT
	remove tap from FD 5
	unlock FDT
	decref 0xf00
	(new syscall)
	lock FDT
	add tap 0xbar to FD 5
	unlock FDT
	register tap 0xbar for FD 5
	register tap 0xf00 for FD 5
	decref and trigger a deregister of f00

	In this case the device could see two separate taps (0xf00 and 0xbar) for the
	same FD (5). It just so happens that one of them will deregister soon. It is
	also possible for an event to fire between the left column's register and
	decref, at which point two events would be created (possibly with the same evq
	and event id).

	The final case to consider is when registration fails. To keep things simple
	for the device, we can make sure that we only deregister a tap if our register
	succeeded. To do this nicely with krefs, we can simply change the release
	method, based on whether or not registration succeeds.