Documentation/kref.txt - akaros - Git at Google

 Krefs and Topics on Refcounting:
 ----------------------------
 We do reference counting of our objects in much the same way as Linux.  The
 stuff we do is based on the Linux way, discussed/described here:
 - http://www.kroah.com/linux/talks/ols_2004_kref_paper/
          Reprint-Kroah-Hartman-OLS2004.pdf
 - http://lwn.net/Articles/336224/
 - Linux's Documentation/kref.txt

 To best understand krefs, keep in mind this quote (from their kref.txt): "the
 kref is appropriate when the object doesn't outlive its last external
 reference."  What we want is to know when all "external" references to an object
 are gone.  At which point we can do something (cleanup, free, etc) and be sure
 that no other code can get a reference.  We want to do this without locks, if
 possible, so we make use of a variety of helpful atomics (though RAMP uses a
 lock-per-atomic).

 This document will ramble about the details and other stuff.
 Here are things to keep in mind, which differ in some of the use cases:
 - we want to do something when the refcount hits 0.
 - we don't always know when that is.
 - we might not have a function to call to explicitly remove the object.
 - we might want to bring an object "back to life."  This is outliving your last
   external reference, but it is something we want.  And so does Linux.

 kref rules (adaptation of the linux ones):
 1. If you pass a pointer somewhere or store it (a non-temporary copy),
 kref_get() it first.  You can do this with no locking if you have a valid
 external reference.
 2. When you are done, kref_put() it.
 3. If you do not have a valid *external* reference, you can try
 kref_get_not_zero().  You must have an internal reference for this.  And to
 protect that *internal* reference, you often need to lock or otherwise sync
 around the source of that internal reference (usually a list-like structure
 (LLS)), unless that ref is otherwise protected from concurrent freeing.
 4. If you plan to resurrect an object (make its refcnt go from 0 to 1), you
 need to use some other form of sync between the final freeing and the
 resurrection.

 We differ from Linux in a few of ways.  We have a kref_get_not_zero(), which
 they talk about in some of their documents, to be a bit more clear about getting
 refs when objects might be at 0.  Some of the differences is just semantics and
 usage.  We also allow incrementing by more than one, which helps in some cases.
 We don't allow decrementing by more than one to catch bugs (for now).  We also
 put the release function in the kref, which is what Linux used to do.  We do it
 to cut down on the number of places the function pointer is writen, since I
 expect those to change a lot as subsystems use krefs.  Finally, we only use
 krefs to trigger an object's release function, which might not free them
 forever.  They can be "resurrected" back to having an external reference.  You
 can do similar things in Linux, but it's not clear (to me) how separate that
 idea is from a kref.

 External (strong) vs internal (weak) references:
 ----------------------------
 I want something to happen (like for the obj to die) when its refcnt hits 0.  I
 don't know when this will happen, or who will be the one who makes the final
 decref.  So we have the kref - it automatically executes a callback when the
 refcnt hits 0.  However, there are cases when we have "references" (possibly
 uncounted pointers) to an object that we do not want to consider in this
 calculation.  A good example is the superblock's list of all files.  If that
 reference is counted, the file will never be freed - it's just a list for
 managing the existing files.  There's a chicken-egg issue, since removing the
 pointer to the file from the list is what should happen in the cleanup after the
 refcount hits 0.  This is the nature of an internal (weak) reference.  External
 (strong) references are those that count towards when an object should be
 freed/called-back/whatever.

 So now we have lists and other sources of a "reference" / pointer (and it
 doesn't matter if is counted in some variable other than refcount).  We need to
 be able to get valid external references for this.  Just about any time that you
 want to use the object, you'll need to do this - or otherwise prevent it from
 dying/being clean-up.  We can't simply kref_get(), since we don't already have a
 valid external reference.  Instead we use kref_get_not_zero(), which uses
 atomic_inc_not_zero() (or something similar).  Using this guarantees that if it
 succeeded, there was *an* external reference - we just didn't have it.  It is
 possible that this fails, and it's a bit slower, so only use it when going from
 an internal reference to an external one.  For lists like the SB's file
 list/tailq, this won't be too hard, since once we hit 0, we remove it from the
 list and free it.  Things are more complicated if we want to do other things
 when we hit 0 - specifically if we want to "come back from the dead" and go from
 0 back to 1 or more.

 However, things aren't quite this easy.  Short version: lock your list.  Long
 version: kref_get_not_zero() only works when we have a valid internal reference.
 It works based on the assumption that its pointer is still the object (and not
 some freed memory).  You need to do something like: "lock the LLS, get your
 pointer from the LLS, get_not_zero, unlock."  The lock protects your internal
 reference.  Later on we talk about some other schemes, but ultimately you need
 to sync or otherwise guarantee your internal reference is valid.

 This is probably why we want to either lock outside our list-solution, or
 integrate the list/hash structure into the owning subsystem.  (mentioned in more
 detail below).  Also note, with locking you need to be careful of deadlocking.
 kref_put() will return 1 if the refcnt hit 0, so you can unlock and do some
 other cleanup without holding a list's lock.  This effectively allows you to
 split the release between what kref_put() does automatically with whatever else
 you want done.

 What about an 'internal' kref?  If we refcnt the refs on lists, we still have
 the same issue of syncing between a list reader and a list writer.  The reader
 needs to atomically read and kref_get (not_zero or otherwise).  Otherwise,
 someone can remove, put, free, and do whatever to the item between the read and
 the kref_get (in theory).

 The reason for this is the same reason we have trouble with lists and internal
 references in the first place: both the list reader and the writer are sharing
 the same reference.  One is trying to kref_get() it, while another is trying to
 kref_put() it.  If the freeing happens in between getting the pointer from the
 list and the kref_getting, we'll have a bug.  (unless we do some tricks related
 to seq counters and grace periods, discussed below).  This is the same issue,
 regardless of whether the internal reference is using another kref or if it uses
 some other scheme (like lock the object and check its state).

 Resurrecting an object after it hit 0
 ----------------------------
 So when we hit 0, we might not be completely done with the object.  This is part
 of the "kcref" (c == cached) design pattern the Linux guys talk about.  The main
 example they use (and we will too) is the dentry - you want to cache them, even
 if their refcnt is 0.  Once their refcount is 0, you still want to do things -
 just not free them.  Specifically, we'll set their state to show they are unused
 (and thus freeable when there is pressure).  The tricky thing is that when a
 cache lookup finds the dentry, whose refcount is still 0, we need to be able to
 get it and set its external refcount to 1 (bring it "back to life").  We can use
 kref_init() or some other function for this - the important thing is that it
 cannot happen while the object is being freed for real (when it is removed from
 the cache).  One way or another, we need to synchronize.

 The way Linux handles this is that there is a lock for the dentry object and one
 overall one for the dentry cache.  You can't convert an internal reference to an
 external reference without hold this lock.  Just locking on the object could
 result in issues where an object is freed and reused as a *different* object,
 which then the cache reader gets instead of failing at trying to get the
 original object.  The kref refcount only helps when the refcount goes from 1 to
 0, and on triggering the followup/release action.

 To resurrect, we can't just do:
 	if (!kref_refcnt(&dentry->d_kref))
 		kref_init(&dentry->d_kref, dentry_release, 1);
 	else
 		kref_get(&dentry->d_kref, 1);

 There is a race on reading the refcnt and mucking with it.  If it is
 concurrently going from 1 -> 0, and we read 0, it is okay.  We still up it to 1.
 However, if we go from 1 -> 0 and read 1, we'll panic when we try to kref_get a
 0'd kref.  Also, doing this would be dangerous going from 0 -> 1 if other code
 would resurrect (which it does not!).  The solution is to use a kref_get that
 doesn't care about 0 (__kref_get()).

 Elsewhere in this documentation, we talk about kref_get_not_zero().  That one
 will try and fail gracefully (used by pid2proc()).  kref_get() will fail on
 zero.  __kref_get() will not fail on zero and will blindly increment, which is
 what we want.

 Trickiness with lockless data structures:
 ----------------------------
 Ideally, we want concurrent access to the dentry cache (or whatever cache has
 krefd objects that we want to resurrect).  Perhaps this is with CAS on linked
 lists, or locks per hash bucket.  Since we need to prevent the resurrection of
 objects from proceeding while another thread could be trying to remove them, we
 need some sync between readers and writers.  Both threads in the scenario we've
 been talking about are going to be writers for the object, though from the
 perspective of the list-like-structure, only one is a writer - the one who makes
 it such that we cannot create an external ref from an internal one, because the
 object is dying.

 Simply doing this: "lookup, object lock, muck with state" will result in bugs,
 since the state change changes the lookup, but the locking happens after the
 lookup.  We could be manipulating a newly slab-allocated object (or anything,
 really), that has an unlocked byte where we expect it (and thus our spin_lock()
 succeeded).

 At this point, there are two options: Handle it in the list-like-structure, or
 play more clever games:

 1: Push it into the list-like structure (LLS)
 ----------------
 In the first case, the state changing needs to be pushed into the data structure
 managing the objects for these types of scenarios.  For example, a
 spinlock-per-hashbucket scenario would need to know to execute a certain
 function on the object if it finds it before unlocking the bucket.  Sync systems
 based on RCU will need to be careful with the final delete function.  I think
 this is part of the reason why Linux doesn't have a built-in solution for a hash
 table - subsystems roll their own using tools (instead of solutions), since the
 removal is coupled with how the subsystem works.

 2: Seq-ctrs and Reused objects
 ----------------
 The other way, and one we may look into, is to push a seq-ctr into the object.
 The plan is then: "{lookup, object lock} (while seq changed), muck with state"
 (being careful to unlock, etc).  The idea is that the seq-ctr gets incremented
 when the object is freed / given back to the slab allocator (which is done by
 the removal thread).  If the seq-ctr changed, that means we are no longer
 looking at the same object, and we need to retry the lookup (start over).  The
 seq-ctr exists for the memory region dedicated to the object, not for a specific
 object.

 This approach needs a little more help: it assumes that any freed object will
 continue to be an object of that type for a little while.  We are dealing with
 slab-allocated objects, so that is a bit more true than when dealing with
 kmalloc().  However, it is possible that there was memory pressure and the slab
 subsystem freed those pages.  We need to be sure that this event has not
 happened before we try to lock the object.  It's okay to lock and unlock an
 'unrelated' dentry.  It's not okay to try and lock a random spot in memory.  We
 could have a variety of read-mostly globals to perhaps signal slab-pressure,
 dentry-slab-pressure, or something like that, depending on who else needs to
 know about that event.  These seem tricky, and probably require some form of
 locking, which leads to the most likely candidate: RCU-style grace periods.

 The thing we really care about is that the pages are not allocated to some other
 use.  For unrelated (TLB) reasons, we may build an RCU-like mechanism for
 dealing with free pages, and we could consider using this - though if the slab
 allocator is asked for memory, it is because we are running out, and we don't
 want to wait for grace periods.  Anyway, we'll probably never do any of this, at
 least until it becomse a problem.

 FYI: the whole lock on possibly the next incarnation of the object will only
 work if we unlock it at some point - might be a while, if it doesn't get
 unlocked until it's spinlock_init()ed as the new object.  If we do this, we
 ought to unlock before slab-freeing.

 Final Notes
 ----------------------------
 There is a difference between removing references to you from all over the place
 and cleaning up references that you know about (like when you know you are in a
 TAILQ all the time vs. being in a process's file list).

 Process management is a bit different, since it does not want to destroy or free
 you until there was some explicit action (calling proc_destroy()).  We still use
 krefs, since we don't know who is the "last one out" to do the freeing, so we
 layer proc_destroy() on top of kref/__proc_free().  This is why we have the "one
 ref to keep the object alive."  For a little while, this ref was stored in the
 pid2proc hash, but that is now holds an internal reference (we have the tech, it
 keeps things in sync with other usage models, and it makes proc_destroy and
 sys_trywait easier).  Note the runnable_list has external references, in part
 because it is a different subsystem (scheduler).

 Remember: the reason for why we have trouble with lists and (internal)
 references: both the list reader and the writer are sharing the same
 reference/pointer.  We need to either make sure that a reader can use the
 pointer (pinned, kref_get(), a flag, whatever while preventing access to the
 pointer from another thread (lock the list)), or that it can *safely* find out
 it needs to try again.  "Safe" means not accessing freed memory - hence the
 grace periods on slab-freed pages.

 Kreffing works because we have a known, synchronous initialization point where
 we kref_init() the refcount to 1.  We can't do that easily when resurrecting or
 even kref_get_not_zero() because another thread may be trying to permanently
 free the object.

 The difference between kref_get_not_zero() and resurrection is that
 get_not_zero() is trying to get an external reference and the object's release
 method has not been called (or it has, and we should fail since our source is a
 weak/internal reference source).  Resurrection is when we've already hit 0,
 release()d, and now want to reuse that object.  For concrete examples:
 get_not_zero() works when getting a file off the superblock file list - it only
 should work if the file is still in the system and not about to be removed from
 the list.  Resurrection is for objects that don't get freed when they lose
 their external references, such as dentries (they get put on the dentry cache).
 On a cache hit for an unreferenced dentry, we'll need to change its state and
 reinit its refcount.  Next time it is kref_put()d to 0, it'll rerun its
 release() method.

 Deadlocking is still an issue, and might be what forces us to more lock-free
 lists.  Here's an example of what not to do: grab the SB's tailq lock, for each
 file, kref_put it, then unlock.  That code might be trying to force all files to
 close, which is just fundamentally flawed anyways (since you're decreffing an
 internal reference, instead of an external one).  But you'll deadlock too, since
 the file's kref release() method will try and grab the same lock to rip the item
 off the list.  So this is a contrived example, but it goes to show that you need
 to be careful about where you are locking and what functions your kref_put()
 might trigger.
	Krefs and Topics on Refcounting:
	----------------------------
	We do reference counting of our objects in much the same way as Linux. The
	stuff we do is based on the Linux way, discussed/described here:
	- http://www.kroah.com/linux/talks/ols_2004_kref_paper/
	Reprint-Kroah-Hartman-OLS2004.pdf
	- http://lwn.net/Articles/336224/
	- Linux's Documentation/kref.txt

	To best understand krefs, keep in mind this quote (from their kref.txt): "the
	kref is appropriate when the object doesn't outlive its last external
	reference." What we want is to know when all "external" references to an object
	are gone. At which point we can do something (cleanup, free, etc) and be sure
	that no other code can get a reference. We want to do this without locks, if
	possible, so we make use of a variety of helpful atomics (though RAMP uses a
	lock-per-atomic).

	This document will ramble about the details and other stuff.
	Here are things to keep in mind, which differ in some of the use cases:
	- we want to do something when the refcount hits 0.
	- we don't always know when that is.
	- we might not have a function to call to explicitly remove the object.
	- we might want to bring an object "back to life." This is outliving your last
	external reference, but it is something we want. And so does Linux.

	kref rules (adaptation of the linux ones):
	1. If you pass a pointer somewhere or store it (a non-temporary copy),
	kref_get() it first. You can do this with no locking if you have a valid
	external reference.
	2. When you are done, kref_put() it.
	3. If you do not have a valid external reference, you can try
	kref_get_not_zero(). You must have an internal reference for this. And to
	protect that internal reference, you often need to lock or otherwise sync
	around the source of that internal reference (usually a list-like structure
	(LLS)), unless that ref is otherwise protected from concurrent freeing.
	4. If you plan to resurrect an object (make its refcnt go from 0 to 1), you
	need to use some other form of sync between the final freeing and the
	resurrection.

	We differ from Linux in a few of ways. We have a kref_get_not_zero(), which
	they talk about in some of their documents, to be a bit more clear about getting
	refs when objects might be at 0. Some of the differences is just semantics and
	usage. We also allow incrementing by more than one, which helps in some cases.
	We don't allow decrementing by more than one to catch bugs (for now). We also
	put the release function in the kref, which is what Linux used to do. We do it
	to cut down on the number of places the function pointer is writen, since I
	expect those to change a lot as subsystems use krefs. Finally, we only use
	krefs to trigger an object's release function, which might not free them
	forever. They can be "resurrected" back to having an external reference. You
	can do similar things in Linux, but it's not clear (to me) how separate that
	idea is from a kref.

	External (strong) vs internal (weak) references:
	----------------------------
	I want something to happen (like for the obj to die) when its refcnt hits 0. I
	don't know when this will happen, or who will be the one who makes the final
	decref. So we have the kref - it automatically executes a callback when the
	refcnt hits 0. However, there are cases when we have "references" (possibly
	uncounted pointers) to an object that we do not want to consider in this
	calculation. A good example is the superblock's list of all files. If that
	reference is counted, the file will never be freed - it's just a list for
	managing the existing files. There's a chicken-egg issue, since removing the
	pointer to the file from the list is what should happen in the cleanup after the
	refcount hits 0. This is the nature of an internal (weak) reference. External
	(strong) references are those that count towards when an object should be
	freed/called-back/whatever.

	So now we have lists and other sources of a "reference" / pointer (and it
	doesn't matter if is counted in some variable other than refcount). We need to
	be able to get valid external references for this. Just about any time that you
	want to use the object, you'll need to do this - or otherwise prevent it from
	dying/being clean-up. We can't simply kref_get(), since we don't already have a
	valid external reference. Instead we use kref_get_not_zero(), which uses
	atomic_inc_not_zero() (or something similar). Using this guarantees that if it
	succeeded, there was an external reference - we just didn't have it. It is
	possible that this fails, and it's a bit slower, so only use it when going from
	an internal reference to an external one. For lists like the SB's file
	list/tailq, this won't be too hard, since once we hit 0, we remove it from the
	list and free it. Things are more complicated if we want to do other things
	when we hit 0 - specifically if we want to "come back from the dead" and go from
	0 back to 1 or more.

	However, things aren't quite this easy. Short version: lock your list. Long
	version: kref_get_not_zero() only works when we have a valid internal reference.
	It works based on the assumption that its pointer is still the object (and not
	some freed memory). You need to do something like: "lock the LLS, get your
	pointer from the LLS, get_not_zero, unlock." The lock protects your internal
	reference. Later on we talk about some other schemes, but ultimately you need
	to sync or otherwise guarantee your internal reference is valid.

	This is probably why we want to either lock outside our list-solution, or
	integrate the list/hash structure into the owning subsystem. (mentioned in more
	detail below). Also note, with locking you need to be careful of deadlocking.
	kref_put() will return 1 if the refcnt hit 0, so you can unlock and do some
	other cleanup without holding a list's lock. This effectively allows you to
	split the release between what kref_put() does automatically with whatever else
	you want done.

	What about an 'internal' kref? If we refcnt the refs on lists, we still have
	the same issue of syncing between a list reader and a list writer. The reader
	needs to atomically read and kref_get (not_zero or otherwise). Otherwise,
	someone can remove, put, free, and do whatever to the item between the read and
	the kref_get (in theory).

	The reason for this is the same reason we have trouble with lists and internal
	references in the first place: both the list reader and the writer are sharing
	the same reference. One is trying to kref_get() it, while another is trying to
	kref_put() it. If the freeing happens in between getting the pointer from the
	list and the kref_getting, we'll have a bug. (unless we do some tricks related
	to seq counters and grace periods, discussed below). This is the same issue,
	regardless of whether the internal reference is using another kref or if it uses
	some other scheme (like lock the object and check its state).

	Resurrecting an object after it hit 0
	----------------------------
	So when we hit 0, we might not be completely done with the object. This is part
	of the "kcref" (c == cached) design pattern the Linux guys talk about. The main
	example they use (and we will too) is the dentry - you want to cache them, even
	if their refcnt is 0. Once their refcount is 0, you still want to do things -
	just not free them. Specifically, we'll set their state to show they are unused
	(and thus freeable when there is pressure). The tricky thing is that when a
	cache lookup finds the dentry, whose refcount is still 0, we need to be able to
	get it and set its external refcount to 1 (bring it "back to life"). We can use
	kref_init() or some other function for this - the important thing is that it
	cannot happen while the object is being freed for real (when it is removed from
	the cache). One way or another, we need to synchronize.

	The way Linux handles this is that there is a lock for the dentry object and one
	overall one for the dentry cache. You can't convert an internal reference to an
	external reference without hold this lock. Just locking on the object could
	result in issues where an object is freed and reused as a different object,
	which then the cache reader gets instead of failing at trying to get the
	original object. The kref refcount only helps when the refcount goes from 1 to
	0, and on triggering the followup/release action.

	To resurrect, we can't just do:
	if (!kref_refcnt(&dentry->d_kref))
	kref_init(&dentry->d_kref, dentry_release, 1);
	else
	kref_get(&dentry->d_kref, 1);

	There is a race on reading the refcnt and mucking with it. If it is
	concurrently going from 1 -> 0, and we read 0, it is okay. We still up it to 1.
	However, if we go from 1 -> 0 and read 1, we'll panic when we try to kref_get a
	0'd kref. Also, doing this would be dangerous going from 0 -> 1 if other code
	would resurrect (which it does not!). The solution is to use a kref_get that
	doesn't care about 0 (__kref_get()).

	Elsewhere in this documentation, we talk about kref_get_not_zero(). That one
	will try and fail gracefully (used by pid2proc()). kref_get() will fail on
	zero. __kref_get() will not fail on zero and will blindly increment, which is
	what we want.

	Trickiness with lockless data structures:
	----------------------------
	Ideally, we want concurrent access to the dentry cache (or whatever cache has
	krefd objects that we want to resurrect). Perhaps this is with CAS on linked
	lists, or locks per hash bucket. Since we need to prevent the resurrection of
	objects from proceeding while another thread could be trying to remove them, we
	need some sync between readers and writers. Both threads in the scenario we've
	been talking about are going to be writers for the object, though from the
	perspective of the list-like-structure, only one is a writer - the one who makes
	it such that we cannot create an external ref from an internal one, because the
	object is dying.

	Simply doing this: "lookup, object lock, muck with state" will result in bugs,
	since the state change changes the lookup, but the locking happens after the
	lookup. We could be manipulating a newly slab-allocated object (or anything,
	really), that has an unlocked byte where we expect it (and thus our spin_lock()
	succeeded).

	At this point, there are two options: Handle it in the list-like-structure, or
	play more clever games:

	1: Push it into the list-like structure (LLS)
	----------------
	In the first case, the state changing needs to be pushed into the data structure
	managing the objects for these types of scenarios. For example, a
	spinlock-per-hashbucket scenario would need to know to execute a certain
	function on the object if it finds it before unlocking the bucket. Sync systems
	based on RCU will need to be careful with the final delete function. I think
	this is part of the reason why Linux doesn't have a built-in solution for a hash
	table - subsystems roll their own using tools (instead of solutions), since the
	removal is coupled with how the subsystem works.

	2: Seq-ctrs and Reused objects
	----------------
	The other way, and one we may look into, is to push a seq-ctr into the object.
	The plan is then: "{lookup, object lock} (while seq changed), muck with state"
	(being careful to unlock, etc). The idea is that the seq-ctr gets incremented
	when the object is freed / given back to the slab allocator (which is done by
	the removal thread). If the seq-ctr changed, that means we are no longer
	looking at the same object, and we need to retry the lookup (start over). The
	seq-ctr exists for the memory region dedicated to the object, not for a specific
	object.

	This approach needs a little more help: it assumes that any freed object will
	continue to be an object of that type for a little while. We are dealing with
	slab-allocated objects, so that is a bit more true than when dealing with
	kmalloc(). However, it is possible that there was memory pressure and the slab
	subsystem freed those pages. We need to be sure that this event has not
	happened before we try to lock the object. It's okay to lock and unlock an
	'unrelated' dentry. It's not okay to try and lock a random spot in memory. We
	could have a variety of read-mostly globals to perhaps signal slab-pressure,
	dentry-slab-pressure, or something like that, depending on who else needs to
	know about that event. These seem tricky, and probably require some form of
	locking, which leads to the most likely candidate: RCU-style grace periods.

	The thing we really care about is that the pages are not allocated to some other
	use. For unrelated (TLB) reasons, we may build an RCU-like mechanism for
	dealing with free pages, and we could consider using this - though if the slab
	allocator is asked for memory, it is because we are running out, and we don't
	want to wait for grace periods. Anyway, we'll probably never do any of this, at
	least until it becomse a problem.

	FYI: the whole lock on possibly the next incarnation of the object will only
	work if we unlock it at some point - might be a while, if it doesn't get
	unlocked until it's spinlock_init()ed as the new object. If we do this, we
	ought to unlock before slab-freeing.

	Final Notes
	----------------------------
	There is a difference between removing references to you from all over the place
	and cleaning up references that you know about (like when you know you are in a
	TAILQ all the time vs. being in a process's file list).

	Process management is a bit different, since it does not want to destroy or free
	you until there was some explicit action (calling proc_destroy()). We still use
	krefs, since we don't know who is the "last one out" to do the freeing, so we
	layer proc_destroy() on top of kref/__proc_free(). This is why we have the "one
	ref to keep the object alive." For a little while, this ref was stored in the
	pid2proc hash, but that is now holds an internal reference (we have the tech, it
	keeps things in sync with other usage models, and it makes proc_destroy and
	sys_trywait easier). Note the runnable_list has external references, in part
	because it is a different subsystem (scheduler).

	Remember: the reason for why we have trouble with lists and (internal)
	references: both the list reader and the writer are sharing the same
	reference/pointer. We need to either make sure that a reader can use the
	pointer (pinned, kref_get(), a flag, whatever while preventing access to the
	pointer from another thread (lock the list)), or that it can safely find out
	it needs to try again. "Safe" means not accessing freed memory - hence the
	grace periods on slab-freed pages.

	Kreffing works because we have a known, synchronous initialization point where
	we kref_init() the refcount to 1. We can't do that easily when resurrecting or
	even kref_get_not_zero() because another thread may be trying to permanently
	free the object.

	The difference between kref_get_not_zero() and resurrection is that
	get_not_zero() is trying to get an external reference and the object's release
	method has not been called (or it has, and we should fail since our source is a
	weak/internal reference source). Resurrection is when we've already hit 0,
	release()d, and now want to reuse that object. For concrete examples:
	get_not_zero() works when getting a file off the superblock file list - it only
	should work if the file is still in the system and not about to be removed from
	the list. Resurrection is for objects that don't get freed when they lose
	their external references, such as dentries (they get put on the dentry cache).
	On a cache hit for an unreferenced dentry, we'll need to change its state and
	reinit its refcount. Next time it is kref_put()d to 0, it'll rerun its
	release() method.

	Deadlocking is still an issue, and might be what forces us to more lock-free
	lists. Here's an example of what not to do: grab the SB's tailq lock, for each
	file, kref_put it, then unlock. That code might be trying to force all files to
	close, which is just fundamentally flawed anyways (since you're decreffing an
	internal reference, instead of an external one). But you'll deadlock too, since
	the file's kref release() method will try and grab the same lock to rip the item
	off the list. So this is a contrived example, but it goes to show that you need
	to be careful about where you are locking and what functions your kref_put()
	might trigger.