| Krefs and Topics on Refcounting: |
| ---------------------------- |
| We do reference counting of our objects in much the same way as Linux. The |
| stuff we do is based on the Linux way, discussed/described here: |
| - http://www.kroah.com/linux/talks/ols_2004_kref_paper/ |
| Reprint-Kroah-Hartman-OLS2004.pdf |
| - http://lwn.net/Articles/336224/ |
| - Linux's Documentation/kref.txt |
| |
| To best understand krefs, keep in mind this quote (from their kref.txt): "the |
| kref is appropriate when the object doesn't outlive its last external |
| reference." What we want is to know when all "external" references to an object |
| are gone. At which point we can do something (cleanup, free, etc) and be sure |
| that no other code can get a reference. We want to do this without locks, if |
| possible, so we make use of a variety of helpful atomics (though RAMP uses a |
| lock-per-atomic). |
| |
| This document will ramble about the details and other stuff. |
| Here are things to keep in mind, which differ in some of the use cases: |
| - we want to do something when the refcount hits 0. |
| - we don't always know when that is. |
| - we might not have a function to call to explicitly remove the object. |
| - we might want to bring an object "back to life." This is outliving your last |
| external reference, but it is something we want. And so does Linux. |
| |
| kref rules (adaptation of the linux ones): |
| 1. If you pass a pointer somewhere or store it (a non-temporary copy), |
| kref_get() it first. You can do this with no locking if you have a valid |
| external reference. |
| 2. When you are done, kref_put() it. |
| 3. If you do not have a valid *external* reference, you can try |
| kref_get_not_zero(). You must have an internal reference for this. And to |
| protect that *internal* reference, you often need to lock or otherwise sync |
| around the source of that internal reference (usually a list-like structure |
| (LLS)), unless that ref is otherwise protected from concurrent freeing. |
| 4. If you plan to resurrect an object (make its refcnt go from 0 to 1), you |
| need to use some other form of sync between the final freeing and the |
| resurrection. |
| |
| We differ from Linux in a few of ways. We have a kref_get_not_zero(), which |
| they talk about in some of their documents, to be a bit more clear about getting |
| refs when objects might be at 0. Some of the differences is just semantics and |
| usage. We also allow incrementing by more than one, which helps in some cases. |
| We don't allow decrementing by more than one to catch bugs (for now). We also |
| put the release function in the kref, which is what Linux used to do. We do it |
| to cut down on the number of places the function pointer is writen, since I |
| expect those to change a lot as subsystems use krefs. Finally, we only use |
| krefs to trigger an object's release function, which might not free them |
| forever. They can be "resurrected" back to having an external reference. You |
| can do similar things in Linux, but it's not clear (to me) how separate that |
| idea is from a kref. |
| |
| External (strong) vs internal (weak) references: |
| ---------------------------- |
| I want something to happen (like for the obj to die) when its refcnt hits 0. I |
| don't know when this will happen, or who will be the one who makes the final |
| decref. So we have the kref - it automatically executes a callback when the |
| refcnt hits 0. However, there are cases when we have "references" (possibly |
| uncounted pointers) to an object that we do not want to consider in this |
| calculation. A good example is the superblock's list of all files. If that |
| reference is counted, the file will never be freed - it's just a list for |
| managing the existing files. There's a chicken-egg issue, since removing the |
| pointer to the file from the list is what should happen in the cleanup after the |
| refcount hits 0. This is the nature of an internal (weak) reference. External |
| (strong) references are those that count towards when an object should be |
| freed/called-back/whatever. |
| |
| So now we have lists and other sources of a "reference" / pointer (and it |
| doesn't matter if is counted in some variable other than refcount). We need to |
| be able to get valid external references for this. Just about any time that you |
| want to use the object, you'll need to do this - or otherwise prevent it from |
| dying/being clean-up. We can't simply kref_get(), since we don't already have a |
| valid external reference. Instead we use kref_get_not_zero(), which uses |
| atomic_inc_not_zero() (or something similar). Using this guarantees that if it |
| succeeded, there was *an* external reference - we just didn't have it. It is |
| possible that this fails, and it's a bit slower, so only use it when going from |
| an internal reference to an external one. For lists like the SB's file |
| list/tailq, this won't be too hard, since once we hit 0, we remove it from the |
| list and free it. Things are more complicated if we want to do other things |
| when we hit 0 - specifically if we want to "come back from the dead" and go from |
| 0 back to 1 or more. |
| |
| However, things aren't quite this easy. Short version: lock your list. Long |
| version: kref_get_not_zero() only works when we have a valid internal reference. |
| It works based on the assumption that its pointer is still the object (and not |
| some freed memory). You need to do something like: "lock the LLS, get your |
| pointer from the LLS, get_not_zero, unlock." The lock protects your internal |
| reference. Later on we talk about some other schemes, but ultimately you need |
| to sync or otherwise guarantee your internal reference is valid. |
| |
| This is probably why we want to either lock outside our list-solution, or |
| integrate the list/hash structure into the owning subsystem. (mentioned in more |
| detail below). Also note, with locking you need to be careful of deadlocking. |
| kref_put() will return 1 if the refcnt hit 0, so you can unlock and do some |
| other cleanup without holding a list's lock. This effectively allows you to |
| split the release between what kref_put() does automatically with whatever else |
| you want done. |
| |
| What about an 'internal' kref? If we refcnt the refs on lists, we still have |
| the same issue of syncing between a list reader and a list writer. The reader |
| needs to atomically read and kref_get (not_zero or otherwise). Otherwise, |
| someone can remove, put, free, and do whatever to the item between the read and |
| the kref_get (in theory). |
| |
| The reason for this is the same reason we have trouble with lists and internal |
| references in the first place: both the list reader and the writer are sharing |
| the same reference. One is trying to kref_get() it, while another is trying to |
| kref_put() it. If the freeing happens in between getting the pointer from the |
| list and the kref_getting, we'll have a bug. (unless we do some tricks related |
| to seq counters and grace periods, discussed below). This is the same issue, |
| regardless of whether the internal reference is using another kref or if it uses |
| some other scheme (like lock the object and check its state). |
| |
| Resurrecting an object after it hit 0 |
| ---------------------------- |
| So when we hit 0, we might not be completely done with the object. This is part |
| of the "kcref" (c == cached) design pattern the Linux guys talk about. The main |
| example they use (and we will too) is the dentry - you want to cache them, even |
| if their refcnt is 0. Once their refcount is 0, you still want to do things - |
| just not free them. Specifically, we'll set their state to show they are unused |
| (and thus freeable when there is pressure). The tricky thing is that when a |
| cache lookup finds the dentry, whose refcount is still 0, we need to be able to |
| get it and set its external refcount to 1 (bring it "back to life"). We can use |
| kref_init() or some other function for this - the important thing is that it |
| cannot happen while the object is being freed for real (when it is removed from |
| the cache). One way or another, we need to synchronize. |
| |
| The way Linux handles this is that there is a lock for the dentry object and one |
| overall one for the dentry cache. You can't convert an internal reference to an |
| external reference without hold this lock. Just locking on the object could |
| result in issues where an object is freed and reused as a *different* object, |
| which then the cache reader gets instead of failing at trying to get the |
| original object. The kref refcount only helps when the refcount goes from 1 to |
| 0, and on triggering the followup/release action. |
| |
| To resurrect, we can't just do: |
| if (!kref_refcnt(&dentry->d_kref)) |
| kref_init(&dentry->d_kref, dentry_release, 1); |
| else |
| kref_get(&dentry->d_kref, 1); |
| |
| There is a race on reading the refcnt and mucking with it. If it is |
| concurrently going from 1 -> 0, and we read 0, it is okay. We still up it to 1. |
| However, if we go from 1 -> 0 and read 1, we'll panic when we try to kref_get a |
| 0'd kref. Also, doing this would be dangerous going from 0 -> 1 if other code |
| would resurrect (which it does not!). The solution is to use a kref_get that |
| doesn't care about 0 (__kref_get()). |
| |
| Elsewhere in this documentation, we talk about kref_get_not_zero(). That one |
| will try and fail gracefully (used by pid2proc()). kref_get() will fail on |
| zero. __kref_get() will not fail on zero and will blindly increment, which is |
| what we want. |
| |
| Trickiness with lockless data structures: |
| ---------------------------- |
| Ideally, we want concurrent access to the dentry cache (or whatever cache has |
| krefd objects that we want to resurrect). Perhaps this is with CAS on linked |
| lists, or locks per hash bucket. Since we need to prevent the resurrection of |
| objects from proceeding while another thread could be trying to remove them, we |
| need some sync between readers and writers. Both threads in the scenario we've |
| been talking about are going to be writers for the object, though from the |
| perspective of the list-like-structure, only one is a writer - the one who makes |
| it such that we cannot create an external ref from an internal one, because the |
| object is dying. |
| |
| Simply doing this: "lookup, object lock, muck with state" will result in bugs, |
| since the state change changes the lookup, but the locking happens after the |
| lookup. We could be manipulating a newly slab-allocated object (or anything, |
| really), that has an unlocked byte where we expect it (and thus our spin_lock() |
| succeeded). |
| |
| At this point, there are two options: Handle it in the list-like-structure, or |
| play more clever games: |
| |
| 1: Push it into the list-like structure (LLS) |
| ---------------- |
| In the first case, the state changing needs to be pushed into the data structure |
| managing the objects for these types of scenarios. For example, a |
| spinlock-per-hashbucket scenario would need to know to execute a certain |
| function on the object if it finds it before unlocking the bucket. Sync systems |
| based on RCU will need to be careful with the final delete function. I think |
| this is part of the reason why Linux doesn't have a built-in solution for a hash |
| table - subsystems roll their own using tools (instead of solutions), since the |
| removal is coupled with how the subsystem works. |
| |
| 2: Seq-ctrs and Reused objects |
| ---------------- |
| The other way, and one we may look into, is to push a seq-ctr into the object. |
| The plan is then: "{lookup, object lock} (while seq changed), muck with state" |
| (being careful to unlock, etc). The idea is that the seq-ctr gets incremented |
| when the object is freed / given back to the slab allocator (which is done by |
| the removal thread). If the seq-ctr changed, that means we are no longer |
| looking at the same object, and we need to retry the lookup (start over). The |
| seq-ctr exists for the memory region dedicated to the object, not for a specific |
| object. |
| |
| This approach needs a little more help: it assumes that any freed object will |
| continue to be an object of that type for a little while. We are dealing with |
| slab-allocated objects, so that is a bit more true than when dealing with |
| kmalloc(). However, it is possible that there was memory pressure and the slab |
| subsystem freed those pages. We need to be sure that this event has not |
| happened before we try to lock the object. It's okay to lock and unlock an |
| 'unrelated' dentry. It's not okay to try and lock a random spot in memory. We |
| could have a variety of read-mostly globals to perhaps signal slab-pressure, |
| dentry-slab-pressure, or something like that, depending on who else needs to |
| know about that event. These seem tricky, and probably require some form of |
| locking, which leads to the most likely candidate: RCU-style grace periods. |
| |
| The thing we really care about is that the pages are not allocated to some other |
| use. For unrelated (TLB) reasons, we may build an RCU-like mechanism for |
| dealing with free pages, and we could consider using this - though if the slab |
| allocator is asked for memory, it is because we are running out, and we don't |
| want to wait for grace periods. Anyway, we'll probably never do any of this, at |
| least until it becomse a problem. |
| |
| FYI: the whole lock on possibly the next incarnation of the object will only |
| work if we unlock it at some point - might be a while, if it doesn't get |
| unlocked until it's spinlock_init()ed as the new object. If we do this, we |
| ought to unlock before slab-freeing. |
| |
| Final Notes |
| ---------------------------- |
| There is a difference between removing references to you from all over the place |
| and cleaning up references that you know about (like when you know you are in a |
| TAILQ all the time vs. being in a process's file list). |
| |
| Process management is a bit different, since it does not want to destroy or free |
| you until there was some explicit action (calling proc_destroy()). We still use |
| krefs, since we don't know who is the "last one out" to do the freeing, so we |
| layer proc_destroy() on top of kref/__proc_free(). This is why we have the "one |
| ref to keep the object alive." For a little while, this ref was stored in the |
| pid2proc hash, but that is now holds an internal reference (we have the tech, it |
| keeps things in sync with other usage models, and it makes proc_destroy and |
| sys_trywait easier). Note the runnable_list has external references, in part |
| because it is a different subsystem (scheduler). |
| |
| Remember: the reason for why we have trouble with lists and (internal) |
| references: both the list reader and the writer are sharing the same |
| reference/pointer. We need to either make sure that a reader can use the |
| pointer (pinned, kref_get(), a flag, whatever while preventing access to the |
| pointer from another thread (lock the list)), or that it can *safely* find out |
| it needs to try again. "Safe" means not accessing freed memory - hence the |
| grace periods on slab-freed pages. |
| |
| Kreffing works because we have a known, synchronous initialization point where |
| we kref_init() the refcount to 1. We can't do that easily when resurrecting or |
| even kref_get_not_zero() because another thread may be trying to permanently |
| free the object. |
| |
| The difference between kref_get_not_zero() and resurrection is that |
| get_not_zero() is trying to get an external reference and the object's release |
| method has not been called (or it has, and we should fail since our source is a |
| weak/internal reference source). Resurrection is when we've already hit 0, |
| release()d, and now want to reuse that object. For concrete examples: |
| get_not_zero() works when getting a file off the superblock file list - it only |
| should work if the file is still in the system and not about to be removed from |
| the list. Resurrection is for objects that don't get freed when they lose |
| their external references, such as dentries (they get put on the dentry cache). |
| On a cache hit for an unreferenced dentry, we'll need to change its state and |
| reinit its refcount. Next time it is kref_put()d to 0, it'll rerun its |
| release() method. |
| |
| Deadlocking is still an issue, and might be what forces us to more lock-free |
| lists. Here's an example of what not to do: grab the SB's tailq lock, for each |
| file, kref_put it, then unlock. That code might be trying to force all files to |
| close, which is just fundamentally flawed anyways (since you're decreffing an |
| internal reference, instead of an external one). But you'll deadlock too, since |
| the file's kref release() method will try and grab the same lock to rip the item |
| off the list. So this is a contrived example, but it goes to show that you need |
| to be careful about where you are locking and what functions your kref_put() |
| might trigger. |