|  | Unorganized Notes on the Virtual File System | 
|  | ------------------------------------- | 
|  | Our VFS is built similarly to Linux's, mostly because I'd like to have somewhat | 
|  | easy integration with ext2 (or at the very least have our own FS that can | 
|  | integrate into Linux easily). | 
|  |  | 
|  | There are four main objects in a filesystem: superblock, inode, dentry, and a | 
|  | file. | 
|  | - Superblock: Specific instance of a mounted filesystem.  All | 
|  | synchronization is done with the one spinlock. | 
|  | - Inode: represents a specific file | 
|  | - Dentry: in memory object, corresponding to an element of a path. | 
|  | E.g. /, usr, bin, and vim are all dentries.  All have inodes.  Vim | 
|  | happens to be a | 
|  | file instead of a directory. | 
|  | - File: represents a file opened by a process. | 
|  |  | 
|  | So far, dentries are the most complicated, so here are some notes about them. | 
|  | These are probably self-consistent, but may be a bit old (or just wrong) because | 
|  | I wrote them as I put the VFS together: | 
|  |  | 
|  | A dentry is just a connection between a part of a path and an inode.  We just | 
|  | want to walk from dentry to dentry, asking each to find the next part in the | 
|  | path.  When you hit a dentry that's not the end of the desired path, you ask its | 
|  | inode for the next dentry by using it's operation (lookup). | 
|  |  | 
|  | lookup() takes the inode of the directory (or 0 for the root) and a dentry | 
|  | (already allocated, with the d_name and i_name filled in), and it will find the | 
|  | correct inode for the given dentry, as well as filling out any specific FS | 
|  | things in the dentry (like d_ops).  It will return 0 on failure, or the dentry | 
|  | pointer passed in on success.  Somehow the nameidata bit will be used too.  This | 
|  | will probably change a bit...  Note that lookup() needs to read the actual | 
|  | directory file and also lookup the inode off the disc, which means it will | 
|  | block. | 
|  |  | 
|  | When the next dentry is a mountpoint, (e.g. /mnt/cdrom), when you ask mnt for | 
|  | cdrom, lookup() will parse it's file (a directory) and see 'cdrom' there as a child | 
|  | entry.  Then it will return cdrom's dentry, like always. | 
|  |  | 
|  | But since cdrom was a mountpoint, (which you can see from the d_mount_point), | 
|  | you need to walk through the structures a bit to get the dentry of the FS rooted | 
|  | at that mountpoint to keep walking.  The VFS can handle that, so lookup() | 
|  | doesn't need to worry about it. | 
|  |  | 
|  | Why are there two dentries associated with a vfsmount?  Normally, a dentry for a | 
|  | directory points to an inode that contains its members.  When a FS is mounted | 
|  | over it, we don't want to destroy that.  Instead, we mark it as a mountpoint, | 
|  | and when following, we walk into the next FS.  The mountpoint is the dentry in | 
|  | the parent mount (using the parent mount's FS superblock), and the root is the | 
|  | dentry of the mounted FS, using the FS superblock of the mounted FS. | 
|  |  | 
|  | Sometimes lookup() will not be called at all.  We want to look for present | 
|  | dentries first (ones that exist in memory already, including all of those with a | 
|  | refcnt).  So we can look in the cache (currently just an SLIST, but will | 
|  | eventually be a hash table).  When hitting in the cache (hashed by dentry name, | 
|  | and probably the parent dir), we need to verify that the dentry's parent is our | 
|  | own (handled by the VFS code).  All vfsmount dentries will be in the cache, | 
|  | since they all have a refcnt (the vfsmount struct points to them). | 
|  |  | 
|  | Dentries for / and pwd and whatnot have a refcnt (pointed to by fs_struct, | 
|  | vfsmounts, etc).  Anything with a pointer to it has a refcnt, so it never goes | 
|  | away.  So if we have an inode in memory, it's entire dentry path is cached, | 
|  | (just follow the parent pointers back).  Note that when a dentry's refcnt hits | 
|  | 0, we do *not* deallocate (like we do with many other structures).  It will just | 
|  | be on the LRU list in the dcache system.  Likewise, every dentry points to its | 
|  | inode, which pins that inode in memory. | 
|  |  | 
|  | Other refcnts: just about everything has a refcnt, and we need to be careful | 
|  | about when we want to use them and dealloc.  Some things would be a pain in the | 
|  | ass, like with the super_block refcnt.  Every dentry has a ref to the SB, but | 
|  | changing that ref every time we add or remove a dentry will probably be an | 
|  | unnecessary penalty if we can ensure all dentries of that FS are gone before | 
|  | removing the superblock through another mechanism.  We'll see.  Mostly, this | 
|  | just means we need to think about what we really want from a refcnt, and whether | 
|  | or not we want the kref / process style refcnting. | 
|  |  | 
|  | Mounting: | 
|  | ------------------------------------------- | 
|  | When you mount, you need to read in the super block and connect the relevant | 
|  | data structures together.  The SB is connected to the vfsmount, which is | 
|  | connected to the dentry of the mount point and the dentry of the root of the FS. | 
|  | This means when mounting the FS, we need to create the dentry for "/", which | 
|  | means we also need the inode, which needs to be read_inode()'d in.  Actually, we | 
|  | might not need to read the inode in right away - we might be able to get away | 
|  | with reading them in on-demand. | 
|  |  | 
|  | All of this means that for every mount point, the SB, vfsmount, dentry, and | 
|  | inode are in memory.  Due to the way dentries link, every dentry and inode back | 
|  | up to the real root are all in memory too.  Having a mount point is like having | 
|  | a process working in that directory - the chain back up is pinned. | 
|  |  | 
|  | d_subdirs: | 
|  | ------------------------------------------- | 
|  | Tracking the links between objects can be tricky.  One pain is d_subdirs. Linux | 
|  | only tracks subdirectories.  We also do this.  I think the reason they do it is | 
|  | since you should be using the dcache to cache lookups, and not scan the linked | 
|  | list of children of a dentry for a specific file.  Though it is handy to know | 
|  | all of your *directory* children.  In KFS, we also track all children in a list. | 
|  | This is to make our lookups work - instead of having an actual directory file | 
|  | with name->ino mappings. | 
|  |  | 
|  | KFS and metadata pinning: | 
|  | ------------------------------------------- | 
|  | KFS pins all metadata ("all paths pinned").  This means that from the root mnt | 
|  | down to the lowest inode, all dentries and corresponding inodes are pinned in | 
|  | memory from creation time.   Yeah, this means we have chunks of metadata for | 
|  | files we aren't using sitting around in RAM.  We also have the *files* sitting | 
|  | around in RAM too.  Not that concerned, for now.  Plus, I don't want to reparse | 
|  | the CPIO backing store to figure out inode fields, directory names, etc. | 
|  |  | 
|  | Page Cache: | 
|  | ------------------------------------------- | 
|  | Every object that has pages, like an inode or the swap (or even direct block | 
|  | devices) has a page_map tracking which of its pages are currently in memory. | 
|  | This is called a struct address_space in linux, which is a confusing name.  We | 
|  | don't have all of the same fields yet, and may be doing things slightly | 
|  | differently, but the basics are the same.  Together, the page_maps and the | 
|  | functions to manipulate them make up the Page Cache.  Every page frame that is | 
|  | in a page mapping can be traced back to its page_map, via pointers in the struct | 
|  | page.  Note the page_mapping is tracked twice for a file, the f_mapping and the | 
|  | i_mapping.  We really only need the i_mapping, but this saves a couple cache | 
|  | misses.  Might go away later. | 
|  |  | 
|  | As a side note, Linux's "address_space" has a great example of the power of | 
|  | their linked lists.  Their struct has a private_list head.  Due to their list | 
|  | implementation, they are able to have a generic list head for a list of any | 
|  | type (it's a struct list_head), and don't need to declare in a generic object | 
|  | (like the page_map) a specific list type (that would get used by specific | 
|  | FS's). | 
|  |  | 
|  | Just because a page is in a page_map, it doesn't mean it actually has the | 
|  | data from the disc in it.  It just means that there is a physical frame | 
|  | dedicated/mapped to be a page_map's holder for a specific page of an object | 
|  | (usually a file on disc).  readpage() is called to fill that page in with what | 
|  | it is supposed to have from its backing store. | 
|  |  | 
|  | This interpretation makes the meaning of "address space" make more sense.  It's | 
|  | the "address space" of the device, mapping from (file,index) -> page_frame. | 
|  | Still, calling it an address space just confuses things with the virtual memory | 
|  | address space. | 
|  |  | 
|  | One of the reasons pages can be in the map without up-to-date data is due to | 
|  | concurrency issues and outstanding I/O requests.  When the page is first being | 
|  | filled up, the mapping exists but the data hasn't been brought in yet.  Other | 
|  | processes can be trying to access the same block/page/byte, and they need to | 
|  | block but to not try and schedule the operation. | 
|  |  | 
|  | So here's how a typical page fault (__handle_page_fault(), either on demand or | 
|  | populated) works on an mmap'd file at a high level. | 
|  | 1. PF is on a virt address -> translated by the vm_region to a file/offset (page). | 
|  | 1b. Weird permission issues?  See below! | 
|  | 2. Look it up in the page_map. | 
|  | 3. If the page is already there, and up-to-date, then great.  Skip to 6.  If | 
|  | there is one, but it's not up to date, skip to 5. | 
|  | 4. If there is no page, get a free page, tie it (bidirectionally) to the inode | 
|  | in the page_map. | 
|  | 5. Now there is a page, but it is not up to date, so call readpage().  This will | 
|  | usually block. | 
|  | 6. Map that page (which has the current contents) into the address space of the | 
|  | calling process (with the appropriate permissions, RO (MAP_PRIVATE, CoW), or | 
|  | RW (MAP_SHARED). | 
|  | Below: Now if in step 1 you had a permissions issue, such as a RW fault on a CoW | 
|  | MAP_PRIVATE, you'll have to notice the type of error and the type of memory | 
|  | region, then go through a separate path: get a new page, copy the contents, and | 
|  | change the mapping.  Note, the page backing that mapping is not backed by the | 
|  | file - it's just sitting around in the virtual memory of the process. | 
|  |  | 
|  | Also, if we want to use the PG_DIRTY flag, we'll need mark the regions as RO | 
|  | until we write fault, at which point we dirty the page and change it to RW. | 
|  |  | 
|  | We could have multiple threads trying to fill a page in the page cache at once. | 
|  | This is handled in file_load_page().  All threads check the page cache.  If two | 
|  | threads try to add it to the page cache, only one will succeed, and the page | 
|  | will be locked (PG_LOCKED).  The one who succeeds will readpage().  The one that | 
|  | didn't will be like any other thread that is checking the page cache - it will | 
|  | see a page is there, and will check it the page is up to date.  If it isn't, it | 
|  | will try to lock the page so it can do the IO, with a few extra checks in case | 
|  | the page had been removed or was filled in while it slept. | 
|  |  | 
|  | A big aspect of this is the use of lock_page() to block.  If the page is locked, | 
|  | you block until it is unlocked.  (implementation and other issues still | 
|  | pending).  Even the caller of readpage will lock after submitting the IO | 
|  | request.  This will cause the caller to sleep until the IO is done.  When the IO | 
|  | is done, that interrupt handler/thread will mark the page as up-to-date, and | 
|  | unlock the page, which will wake up any of the waiters.  The caller of | 
|  | readpage() may not be the first to wake up either.  This all means the IO system | 
|  | needs to know about marking pages as up-to-date and unlocking them.  This | 
|  | currently (Jul10) is just sitting in KFS, but can be done later either directly | 
|  | or with a callback made by | 
|  | whoever starts the IO. | 
|  |  | 
|  | A note on refcnting.  When a page is added to the page cache, that's a stored | 
|  | reference.  When you lookup a page in the page cache, you get a refcnt'd | 
|  | reference back.  When you pull a page from the page cache, you also get a | 
|  | refcnt'd reference back - specifically it is the ref that was in the page map. | 
|  |  | 
|  | Files with Holes | 
|  | -------------------------- | 
|  | If a file has a hole, we'll still try to look up the page in the page cache. | 
|  | When that doesn't happen, we'll create and add a page, then call readpage(). | 
|  | Readpage will realize there is no page on disk/backing store for that part of | 
|  | the file (since it was a hole) and just memset the page to 0.  In the future, we | 
|  | can consider getting a CoW 0-page, but that's a bit premature and a bookkeeping | 
|  | pain. | 
|  |  | 
|  | This also applies to trying to write to a block beyond the EOF.  If the request | 
|  | hits the page cache and readpage(), it's because it was already checked and | 
|  | cleared in another part of the VFS, such as in generic_file_write(). | 
|  |  | 
|  | Files That End Before a Page Boundary | 
|  | -------------------------- | 
|  | So what if we have a file that is only 1024 bytes.  When we read it in, we'll | 
|  | have a whole page added to the page cache, with the extra 3K 0'd.  When we go to | 
|  | write it back, it will write back the 1K, and ignore the rest.  But what if we | 
|  | extend the file later?  That page is already in the page cache, but the buffer | 
|  | heads aren't tracking the new 3K.  When that page gets evicted, only the 1K will | 
|  | write back. | 
|  |  | 
|  | There are two ways to deal with this: 1) When we extend the file, check and see | 
|  | if it is in the middle of a page, and if so, alloc the blocks (all FS | 
|  | specific!), and adjust the BHs to map the new data.  2) Preallocate the | 
|  | remaining blocks and stitch up the BH mapping at readpage (which is in the | 
|  | specific FS).  This way, we reserve the blocks (but don't have to actually read | 
|  | them in - we just mark the buffer as dirty).  When we grow a file, we don't need | 
|  | to worry about any of these details.  We just make sure the page map has the | 
|  | page, and not whether or not it's a half-page that needs a FS-specific solution. | 
|  | I'm going with #2 for now.  Note that while the blocks are allocated, the file's | 
|  | size is still 1024B. | 
|  |  | 
|  | Kref, Dentries, Inodes, and Files (or "I don't see why it's like X, but..." | 
|  | -------------------------- | 
|  | There are multiple dentries pointing to an inode.  The dentries are (or will be) | 
|  | cached too, but that is irrelevant.  The dentries pin the inodes in memory. | 
|  | However, files pin inodes in memory (or they did) briefly.  After running around | 
|  | in circles a bit, I asked, why doesn't the file pin the dentry instead of the | 
|  | inode?  The answer: it is supposed to.  Linux does that, and I didn't because | 
|  | pinning the inode made more sense at the time. | 
|  |  | 
|  | The heart of the issue is about understanding what files are and how they | 
|  | relate to the rest of the VFS.  A 'file' in the OS is a structure to track an | 
|  | open FS-disk-file, which is managed by the inode.  Given this, it makes sense | 
|  | that a dentry (which is a name on a path) would be pinned by the corresponding | 
|  | inode, and the file would pin the inode.  It doesn't, but it is believable.  In | 
|  | reality, there are many names (dentries) for a given disk file, and the OS file | 
|  | that you open corresponds to *one* of those names, and thus a dentry, and not to | 
|  | the inode/specific file.  You need to go through the dentry to pin the inode. | 
|  |  | 
|  | In short, it's not: file -> inode -> dentry -> parent_dentry -> ... | 
|  | It's file -> dentry -> parent_dentry -> | 
|  | |-> inode      |-> parent_inode | 
|  | Another dentry and file (both OS and disk) can point to the same inode.  If you | 
|  | don't do it this way, you can't pin up past the multi-dentry-point in the inode, | 
|  | and your system doesn't really make sense. | 
|  |  | 
|  | So here is how it works: files pin dentries.  Dentries can pin other dentries, | 
|  | on up the directory hierarchy.  Independently of the files, dentries pin their | 
|  | inode.  There are many dentries per inode (or can be).  Since each dentry | 
|  | doesn't know if it is the last dentry to decref the inode, we use a kref on | 
|  | i_kref.  The inodes are storing references to the dentries, but they are the | 
|  | kref "internal" / weak references.  Even if we did decref them, we don't trigger | 
|  | anything with it. | 
|  |  | 
|  | The moral of the story is that if you don't fully understand something, you are | 
|  | not in as good of a position to recommend changes or criticize as if you did | 
|  | your homework.  Not that you can't, just that you should 'do your homework.' | 
|  |  | 
|  | Musings on path_lookup() | 
|  | -------------------------- | 
|  | Things can get tricky with path lookup, especially with ., .., and symlinks. | 
|  | When doing a LOOKUP_PARENT on a . or .., we give the parent of whatever the path | 
|  | would have resolved too.  So /dir1/dir2/dir3/.'s parent is dir2. | 
|  | /dir1/dir2/dir3/..'s parent is dir1.  I don't think Linux does this (note the | 
|  | parent lookup is for internal kernel stuff, like when you want to edit | 
|  | metadata).  When you try to make a . or .. file, you should get some sort of | 
|  | error anyways.  We'll see how this works out. | 
|  |  | 
|  | Symlinks can be a bit tricky.  We handle ours a bit differently too, especially | 
|  | regarding PARENT lookups.  Ultimately, you can do the same things in ROS that | 
|  | you can do in Linux - if you try to create a file that is a dangling symlink, | 
|  | you'll correctly create the destination file.  We handle this in | 
|  | link_path_walk().  It will return the PARENT of whatever you would resolve to - | 
|  | instead of trying to handle this in do_file_open() (which I think linux does). | 
|  |  | 
|  | Also, our handling of symlinks differs a bit from linux.  Eventually, it has | 
|  | become clear we're going to need to manually port ext2, and we do some things | 
|  | differently in our core VFS already.  Might as well do more thing differently - | 
|  | like getting rid of follow_link and put_link from the FS specific sections.  Our | 
|  | FSs just need to know how to return a char* for a symname - and not do any of | 
|  | the actual link following.  Or any of the other stuff they do.  We'll see if | 
|  | that turns out to be an issue or not... | 
|  |  | 
|  | Unlinking and other Link Stuff | 
|  | ------------------------- | 
|  | Unlinking is just disconnecting a dentry-inode pair from the directory tree, and | 
|  | decreasing the inode's i_nlink.  Nothing else happens yet, since we need to keep | 
|  | the FS-file (controlled by the dentry/inode) so long as any OS-files have it | 
|  | open.  They have it opened via open or mmap - any way that there is a reference | 
|  | to a file, which then pins the dentry and inode.  When the OS-files close, | 
|  | eventually the dentry's refcnt hits 0.  When it does, it normally would be up | 
|  | for caching, but we can check nlinks and just drop it.  When that happens, it | 
|  | releases the inode, which will see its nlinks is 0.  That will trigger the | 
|  | underlying FS to clear out the FS-file. | 
|  |  | 
|  | For directories, you can only have one hardlink to a directory - meaning you are | 
|  | only in the directory tree in one place.  However, all of your children can get | 
|  | to you by going '../'.  We'll count these as hardlinks too.  This means that | 
|  | every child increments its parent-dir's nlink.  This is the on-disk links, not | 
|  | to be confused with the dentry->d_parent and kref() business that goes on for | 
|  | the in-memory objects.  A directory cannot be removed if nlinks > 1.  If it is | 
|  | 1, then you can rmdir it, which will set its nlinks to 0.  Then its inode's | 
|  | storage space will get freed when it is deleted, like any other inode.  In | 
|  | theory. | 
|  |  | 
|  | Inodes: drop? delete? dealloc? | 
|  | -------------------------- | 
|  | Inodes exist both in memory and on disk, but in different manners.  When a file | 
|  | (F_WHATEVER, could be DIR) exists in an FS, it'll have an inode on disk.  When | 
|  | it is time to delete that file, we call _delete_inode().  When we want to free | 
|  | the memory associated with an in-memory (VFS) inode, we call _dealloc_inode(). | 
|  |  | 
|  | What about drop_inode?  For now, we don't use it.  We have inode_release() in | 
|  | the VFS.  If we need an FS specific one (like for ext2), or have FS-specific | 
|  | work that needs to be done in inode_release(), we'll use it later. | 
|  |  | 
|  | Either way, inode_release() is called when we no longer use the in-memory inode. | 
|  | If there are no hard links, it will delete the inode.  Either way, it will just | 
|  | free the in-memory inode (after deleting the disc version). | 
|  |  | 
|  | Example: I want to unlink (rm) a file.  There are two cases: the file is already | 
|  | open (with a dentry and the inode in memory) or the file is not.  Both cases are | 
|  | handled the same way!  In either case, we eventually call do_lookup on the item | 
|  | in question, getting both a dentry and its inode read in.  (We read it in for a | 
|  | couple reasons: convenient to check the type, and we need to manipulate the | 
|  | nlink).  If a process has the file open, or even if it is sitting in the cache, | 
|  | we will get the same inode (from the inode cache, might not be implemented yet). | 
|  | When we decref the dentry and it is done, it will decref the inode.  This | 
|  | dentry's final decref will be deferred until any open files are closed.  Note, | 
|  | this requires a working dentry/inode-cache - otherwise we'll have multiple | 
|  | copies of the same FS/disk-inode (and possibly dentry).  Anyway, when this is | 
|  | done, the release function will delete the inode, then dealloc it. | 
|  |  | 
|  | Another example:  We simply close a file.  When that happens, we decref the | 
|  | dentry, which decrefs the inode.  It may remain cached for a bit - not a big | 
|  | deal.  When it is finally removed, nlinks is positive, so the inode's in memory | 
|  | copy is written back (if it was dirty) and the structure is deallocated. | 
|  |  | 
|  | Side notes: dentry cached inodes should be removed after their lookup in unlink. | 
|  | Also, since multiple dentries point to the same inode, it's not enough to just | 
|  | cache dentries - we need to be able to find inodes too so that we get the one | 
|  | inode regardless of which dentry we use (which may be uncached). | 
|  |  | 
|  | Dentry and Inode Caches | 
|  | -------------------------- | 
|  | The dentry caches dentry lookups - we need the parent and a hash of the name to | 
|  | do a lookup.  The dcache consists of a hash table for the lookups, as well as an | 
|  | extra list of entries that are unused (their kref is 0).  The dcache also caches | 
|  | negative entries - entries that were wrong.  This still speeds up future | 
|  | requests.  Most uses of the system just need to use dcache_get and dcache put. | 
|  | Not all of this is implemented yet. | 
|  |  | 
|  | The inode cache is similar, though we can't just have the inodes hang off the | 
|  | dentry cache (more than one dentry points to the same inode).  We don't need to | 
|  | worry about unused lists or anything like that - once the kref hits 0, we're | 
|  | done and we can rip it out of the hash. | 
|  |  | 
|  | Both hashes hang off the superblock, with concurrent access protected by locks | 
|  | in the SB. | 
|  |  | 
|  | The dentry cache is the weirdest of them all - for normal entries, its key and | 
|  | value are the same thing.  The actual hashing of a dentry is done by the qstr | 
|  | value, and to determine equality, we need to compare parents (compared to the | 
|  | inode cache, where the only thing that matters is the i_ino).  Put another way, | 
|  | we need elements of the whole dentry to get a unique key (d_parent and d_name). | 
|  |  | 
|  | As stated above, the dcache also caches negative entries.  This is to prevent a | 
|  | lookup on disk.  These negative entries are in the dcache and on the LRU list | 
|  | (their refcnt is 0, the are not USED).  When we dcache_get, we don't bother with | 
|  | returning the actual dentry (after increffing) and then decref it again. | 
|  | Instead, we just return the negative result (via the query dentry, | 
|  | incidentally). | 
|  |  | 
|  | Freeing of dentries happens in one of two ways: call __dentry_free() directly, | 
|  | which is appropriate when you have the only copy (like in do_lookup()), or it | 
|  | will get freed when the dcache gets reaped (the LRU entries are freed).  When it | 
|  | is decref'd, it simply goes into a state where it is ready to be reaped, but | 
|  | kept around for future lookups - most usages throughout the vfs can just decref | 
|  | when they are done using it. | 
|  |  | 
|  | One complication is cached negative dentries.  These are only referenced once | 
|  | (in the dcache), so they can get __dentry_free()d directly.  This gets tricky | 
|  | with rmdir and unlink.  Initially, those functions marked the dentry as negative | 
|  | and unused, and would let them stay in the dcache (returning negative results on | 
|  | future lookups).  The problem with this is that now the dcache could have a | 
|  | negative dentry that was a real, formerly used dentry - one with a refcnt that | 
|  | needs to be decref'd and released. | 
|  |  | 
|  | There are two solutions: one is to change the dcache to not assume that negative | 
|  | entries are unreferenced (which also means on the LRU).  The other is to just | 
|  | remove the dentry from the dcache on rmdir/unlink.  It won't be negative - and | 
|  | that won't matter, since it is un-lookup-able.  And it will die nicely when it | 
|  | gets decref'd.  All we'll do is add a DENTRY_DYING flag, and dentry_release() | 
|  | will avoid LRU and unusing it.  The dcache can continue to assume that negative | 
|  | entries are unused/LRU/dentry_freeable/ref==0, and not worry about calling | 
|  | kref_put(). | 
|  |  | 
|  | X: Buffer Cache, Page Cache, Buffer Heads, WTH? | 
|  | ------------------------------- | 
|  | X.1: Page vs Buffer Cache | 
|  | -------------------- | 
|  | So we discussed the page cache, but as described, it does not satisfy all of | 
|  | our disk caching needs.  Traditionally, there would also be a 'buffer cache.' | 
|  | Buffers usually refer to memory used to hold data from the disk (or network). | 
|  | I can think of a couple different ways someone could think of the buffer | 
|  | cache, and to understand them, we first need to understand what we need to be | 
|  | caching. | 
|  |  | 
|  | There are two main types of FS data: file data and metadata.  This metadata | 
|  | is FS-specific data accessed by the VFS to get to a file's contents.  For | 
|  | example, the superblock, inodes, inode indirect blocks, and to a certain | 
|  | extent the directory's contents are all metadata.  There isn't an FS file (not | 
|  | to be confused with an OS file) that corresponds to this data, but it needs to | 
|  | be read in and cached.  Furthermore, this cache needs to be managed and | 
|  | written back when dirty.  Note that file data can be broken up into two | 
|  | different types: read()/write() data and mmap'd data.  When people talk about | 
|  | buffer caches versus page caches, they are talking about these two different | 
|  | types of file data (at least NetBSD did | 
|  | http://www.usenix.org/event/usenix2000/freenix/full_papers/silvers/silvers_html/). | 
|  |  | 
|  | A very simple buffer cache would include anything *ever* read from disk.  It | 
|  | would then get copied into the page cache in PGSIZE chunks for the page cache. | 
|  | This would suck, since we now have two or three copies.  We obviously want to | 
|  | use the page cache for both mmap() and read/write().  It's not clear about the | 
|  | metadata. | 
|  |  | 
|  | Another usage of a buffer cache would be to cache only the disk blocks needed | 
|  | for metadata.  I think this is what Linux did before it unified its buffer and | 
|  | page caches (implied in UTLK).  The main issue with this is that you have two | 
|  | different systems for managing essentially similar data - we only want to deal | 
|  | with flushing, syncing, and writebacks of one subsystem, not in multiple | 
|  | different subsystems. | 
|  |  | 
|  | The solution is to use the page cache to cache the metadata blocks' buffers. | 
|  | We do this by having the block device be 'mapped' (but not read) in PGSIZE | 
|  | chunks through its own struct page_mapping (a.k.a. struct address_space in | 
|  | Linux).  This way, both files and the block device are mapped in PGSIZE chunks | 
|  | via the same page_mapping structure, and will be managed by the same code. | 
|  |  | 
|  | Sort of.  We don't actually read in PGSIZE chunks for the block buffer cache. | 
|  | If blocks will be in the bdev's cache, then they will be adjacent and on the | 
|  | same page.  It is possible some adjacent blocks (which would be on the same | 
|  | page) are not both metadata. | 
|  |  | 
|  | A more accurate way to describe what we do is that metadata blocks are copied | 
|  | into a 'buffer cache' that is mapped and managed similarly to the page cache. | 
|  | Pages are made of buffer heads, which hold data, and the reclaiming of pages of | 
|  | memory from either the page cache or the buffer cache will use the same code - | 
|  | since they are both just made of buffer pages. | 
|  |  | 
|  | X.2: Mapping Blockdev Data vs File Data | 
|  | -------------------- | 
|  | An important distinction between file data (as managed by an inode) and the | 
|  | bdev is that PGSIZE chunks of data for the bdev *must* be made of contiguous | 
|  | disk blocks.  Ideally, file data is also contiguous/sequential, but that is | 
|  | not always the case - hence the need for the inode's block pointers.  This | 
|  | means that the chunk at the bottom of the page_mapping radix tree is a page, | 
|  | and on that page there may be several buffers, holding the data of | 
|  | nonsequential disk blocks - but that not all pages are like this.  The bdev | 
|  | pages are made up of sequential blocks due to the very nature of what we are | 
|  | mapping; there's no inode abstraction in between. | 
|  |  | 
|  | There are a couple ways we could handle this.  We adopt the Linux approach of | 
|  | using something called a buffer head (BH), which describes the mapping from | 
|  | in-memory buffer to block device / block number.  These are slab-allocated, | 
|  | and exist for each buffer of a page.  The page itself points to the first of | 
|  | its BHs, all of which exist in a LL.  The maximum number of them is determined | 
|  | by PGSIZE / blocksize.  Whenever there is a page in the page cache (meaning, in | 
|  | a page_mapping), that is up to date, it will have a BH. | 
|  |  | 
|  | Another way would be to not have BHs at all, and just figure out (at | 
|  | operation-time) what the n blocks on disk are for any given page, and submit | 
|  | the IO operations for those blocks.  The BHs serve as a cache of that info. | 
|  | They also serve as a location to store data about the mapping, such as whether | 
|  | it is dirty or not, whether the data is up to date or not (has it been read | 
|  | in), etc.  The caching ought to be rather beneficial, since it means we do not | 
|  | need to access the disk-inode and indirect blocks whenever we want to perform | 
|  | an operation (which may be frequent - imagine the periodic writeback of an | 
|  | mmap'd file undergoing writes).  The per-buffer dirty tracking will also help: | 
|  | no need to write unrelated blocks if only one is edited (though this will not | 
|  | help with mmap(), since we don't usually know which part of the page is | 
|  | written). | 
|  |  | 
|  | X.4: Do we always need BHs? | 
|  | ------------------ | 
|  | But what about those pages made up of contiguous blocks?  We don't have a | 
|  | bunch of independent, non-sequential blocks that we need to map.  Do we need a | 
|  | bunch of BHs for that?  Do we need any?  It really seems excessive to break it | 
|  | up into separate buffers for no reason.  At the other extreme, we could get by | 
|  | without having a BH at all, though this gets back to the other issue of | 
|  | caching.  What we do (or will do) is have one BH for the entire page of | 
|  | contiguous blocks.  If the page is a "buffer page," in Linux terms (meaning it | 
|  | has separate buffers), it will have n BHs in a LL.  Either way, we'll always | 
|  | have the mapping handy.  We wouldn't need to re-verify the contiguous nature of | 
|  | the blocks anyways, since the fact that the page was up to date and didn't need | 
|  | a BH would mean it was contiguous.  Further benefits to using the one BH | 
|  | include: 1) we are likely to make BHs be the unit of IO *submission*, and having | 
|  | one handy will simplify that code. 2) some code paths within the VFS may get BHs | 
|  | as a return value, which they can then dirty.  Always having a BH makes this | 
|  | easier (no need to find out if it's a buffer page, then decide how to dirty it). | 
|  |  | 
|  | Another compliation with this is certain code will want a block in the middle | 
|  | of a page (very common for metadata).  That code will get the BH for the | 
|  | entire page back.  It will need to determine the starting block and then the | 
|  | offset of the block it wants.  Note, this usage of the BHs is finding the | 
|  | buffer corresponding to a block number.  The BH is a bidirectional mapping | 
|  | between buffers and blockdev:blocknum.  These calculations are a minor | 
|  | complication, but easy enough to do, and will probably be worth it.  The | 
|  | tradeoff is against having multiple BHs, which would mean multiple block IO | 
|  | requests for writing a single page (and writing the whole page may be a common | 
|  | operation). | 
|  |  | 
|  | X.3: What about opening a blockdev as a file? | 
|  | -------------------- | 
|  | Basically, don't do it for now.  The way we actually do things is a buffer cache | 
|  | page is "up to date", but has no BHs or data until a specific block is | 
|  | requested.  This is because we don't always know if all the blocks mapped by a | 
|  | page are actually metadata blocks.  If they aren't we'll have issues where we | 
|  | read in extra blocks that exist in both a file's page cache and the block | 
|  | device's buffer cache. | 
|  |  | 
|  | A safe invariant is that a given block will only ever be in one cache: either a | 
|  | file's page mapping or the buffer cache's page mapping.  When these blocks are | 
|  | freed, we'll need to rip them out of their mapping (and possibly flush them). | 
|  |  | 
|  | There is one easy way to avoid this: don't open a bdev file if a file system is | 
|  | mounted on top of it.  If you do, don't be surprised about inconsistencies. | 
|  | Ideally, the FS will never leave the actual disk in an inconsistent state, but | 
|  | the bdev's page cache could read things at different times and get some weird | 
|  | results.  Just don't do this (for now - not like I plan to make this possible | 
|  | any time soon). | 
|  |  | 
|  | Could we use the same buffers for both the blockdev-file page mapping and the | 
|  | inode-file page mapping?  No - because the inode-file has the inode | 
|  | indirection in between, which means a PGSIZE chunk of the file might not be | 
|  | contiguous (as mentioned above). | 
|  |  | 
|  | We could have tried to avoid this bdev file problem by having the page mapping | 
|  | radix trees point to a set of BHs that describes that page worth of buffers, | 
|  | so that the bdev and an inode pointing to the same data will use the same | 
|  | buffers and BHs.  That won't work, since the files buffers/blocks aren't in | 
|  | the same order as they are on disk within a page.  Instead, we'd need to have | 
|  | the page mapping go down to the FS's blocksize granularity, not to PGSIZE, so | 
|  | that we could have independent leaves point to wherever they want.  This would | 
|  | push the specific block device's blocksize into the VFS (which I don't like). | 
|  |  | 
|  | But the blocksize-infecting-the-VFS alone isn't too bad.  The real issue is | 
|  | what is at the end of the page mapping.  Is it a page or a buffer/BH?  We want | 
|  | pages for two related reasons: higher levels of the kernel's IO systems deal | 
|  | with pages: | 
|  | 1. mmap().  Good old mmap() requires at least a page of contiguous block | 
|  | buffers, so that the physical page frame can be mapped directly into a | 
|  | process's address space. | 
|  | 2. Fast IO.  We want to use page remapping for fast IO.  This means that IO | 
|  | has to be in PGSIZE chunks - not blocksize chunks. | 
|  |  | 
|  | x.4: Managing Dirty Buffers | 
|  | -------------------- | 
|  | Many (some/all?) of the block functions will return a BH to describe the | 
|  | mapping.  One reason for doing this (mentioned above) is to allow the caller | 
|  | to manage if a buffer is dirty or not.  The tracking of dirty pages is done by | 
|  | the page cache.  The dirtying is done by the caller; it can simply mark the BH | 
|  | dirty and forget about it.  The writeback is handled usually by the page cache | 
|  | or is triggered by an FS event.  Eventually, we'll have a kernel thread that | 
|  | periodically writes dirty buffers back to disk.  Processes can also do this by | 
|  | calling fsync.  FSs themselves will trigger syncs of metadata.  This will come | 
|  | from having dirty SBs and inodes in the VFS. | 
|  |  | 
|  | Note, the issue of whether or not we pin metadata blocks, such as the inode's | 
|  | indirect blocks (or other related blocks), in the page cache is independent of | 
|  | all these issues.  If they are not cached / pinned, we would just have to | 
|  | block and reread the inode's structures back into memory and proceed (for | 
|  | instance, we'd do this when identifying blocks for a page mapping on a file | 
|  | read).  The reason we wouldn't want to pin them is to save memory. | 
|  |  | 
|  | x.5: Reference Counting BHs and Pages | 
|  | -------------------- | 
|  | There are a lot of complications with this, and if there is a good reason, | 
|  | we'll change this later. | 
|  |  | 
|  | x.5.1: Basics | 
|  | ----------- | 
|  | So we talk about passing around BHs, both when submitting IO ops and when | 
|  | returning from things like get_buffer().  However, currently, we do not kref | 
|  | or reference count BHs in any way.  Instead, we kref pages.  We do this (for | 
|  | now) for a couple reasons: | 
|  | 1) Pages are the unit of memory management in the kernel.  Higher levels of | 
|  | the kernel will pin/incref the page (such as in an mmap()). | 
|  | 2) BHs hang off of a page, but exist only to be expressive about the | 
|  | management of the buffers on the page.  It's not like how a file hangs off a | 
|  | dentry, where the dentry doesn't know (or care) about the file. | 
|  | 3) We already refcount pages.  While we could do the same for the BHs, it is a | 
|  | bit redundant.  Any place that would be kreffing the BH can just kref the | 
|  | whole page.  Code that receives a BH as a return value is actually getting a | 
|  | page under the covers (though should use put_buffer() to drop its reference). | 
|  |  | 
|  | x.5.2: How we refcnt pages in a page mapping | 
|  | ----------- | 
|  | When a page is in the page cache, we give it one kref.  Whenever a function or | 
|  | subsystem is using one of these pages (IO, using the data, etc), there needs | 
|  | to be a kref.  When it is time to remove the page from the page mapping, the | 
|  | code needs to remove it from the radix tree, then kref_put it.  When the final | 
|  | user is done with it (ideally, there is none, but we need to handle the case) | 
|  | the release function will clean up the page - including freeing its BHs. | 
|  |  | 
|  | This does suck in that we could be removing an item from the page cache while | 
|  | it is being used, violating some LRU policy.  The actual decision to remove it | 
|  | should use those policies (when possible); the kreffing is simply to avoid | 
|  | issues from race conditions.  (A reader starts using a page right before it is | 
|  | ripped from the mapping). | 
|  |  | 
|  | x.5.3: More issues with Evictions | 
|  | ----------- | 
|  | One issue with this is that dirty pages/buffers will need to be written back. | 
|  | If someone tries to read while the page is removed from the page_mapping, but | 
|  | before it is written back, they could get an old version if the read happens | 
|  | before the write.  This is only an issue if the page is dirty.  One option | 
|  | would be to writeback the page/buffer, then later remove it from the page | 
|  | cache when it is read.  There's issues with concurrent writers, though if that | 
|  | happens, we probably don't really want to remove it (it was LRU).  Note this | 
|  | is an issue regardless of whether or not BHs are refcounted.  Also, this is | 
|  | not an issue for when we kick a dentry/inode out of the cache - there should | 
|  | be no one else trying to use it (since their refcnt was 0).  This is just a | 
|  | concern when the system runs low on memory and wants to reclaim potentially | 
|  | memory held by caches. | 
|  |  | 
|  | Also note that writeback of pages will happen regardless of eviction plans | 
|  | (fsync, every n sec, etc). | 
|  |  | 
|  | This refcounting pattern is very similar to unlinking, where you can continue | 
|  | to access a file once it is removed from the directory.  The difference here | 
|  | is that future requests will get the same object (the page), unlike in the FS, | 
|  | where you get ENOENT.  The page mapping is a cache, and you need to get the | 
|  | same old data that was in there before the eviction. | 
|  |  | 
|  | A final issue is when the VFS aggressively pins blockdev (metadata) buffers. | 
|  | Ideally, we'd like to be able to expel pages/buffers even if they are | 
|  | refcnt'd.  The subsystems will always want to keep stuff in RAM.  This | 
|  | also/especially applies to mmap().  One solution would be to keep them in RAM, | 
|  | but have the BH keep track of who is holding its reference.  Then we could | 
|  | unmap the page, which would need to get read back in on its next access.  We'd | 
|  | need (or ought to have) some sort of callbacks.  This will get solved later | 
|  | when we deal with unmapping mmap'd files, since it is the same problem - just | 
|  | with different code and actors. | 
|  |  | 
|  | x.5.4: What about buffers inside pages? | 
|  | ----------- | 
|  | For a while, I thought about refcounting BHs/buffers.  The issue that drives | 
|  | it is the buffer cache (block dev page mapping holding metadata blocks of a | 
|  | FS).  We had been operating on the cache in page-sized chunks, which | 
|  | erroneously was reading in blocks adjacent to metadata blocks.  This would | 
|  | have been an issue when we write back pages that have dirty blocks; blocks | 
|  | were erroneously in the metadata cache and would overwrite potentially | 
|  | file-realted blocks that were in an incoherent cache (the file/inode's page | 
|  | mapping). | 
|  |  | 
|  | We broke the bdev's buffer cache up into smaller BHs, so that only metadata | 
|  | blocks get read in, but we eventually will have to get rid of a metadata block | 
|  | (and not the entire page from the cache) (ex: an inode is removed from the | 
|  | disk - its indirect blocks need to go, and they could be next to anything on | 
|  | disk).  The desire for the BH refcnt came from wanting to rip the BHs out of | 
|  | the list when it was time to evict them from the cache (in case they became | 
|  | file blocks later).  It isn't clear what the best way to do this is.  Probably | 
|  | we'd have all users refcnt the BHs, which refcnt the pages.  However, some | 
|  | users (like mmap and the file page cache) operate in page chunks - so this | 
|  | would require them to incref and decref the BHs.  Triggering the page reclaim | 
|  | might also be tricky.  One option would be to just rip a page from the cache, | 
|  | callback to whoever has it loaded so they fault it back in later, and | 
|  | sleep-block everyone who interferes with the operation.  Yikes. | 
|  |  | 
|  | Another option was to forget about the BH <-> page crap for the buffer cache, | 
|  | and just have a standalone buffer cache with BHs as its unit of operation | 
|  | (instead of a page), and replicate the algorithms/code for the buffer cache. | 
|  | There is still a notion of BHs in a page, and page_release() / page_free() | 
|  | would probably have to be a little different since its page isn't really a | 
|  | PG_BUFFER (but it really is). | 
|  |  | 
|  | Instead, here's what we do: we refcnt at page granularity, since all users can | 
|  | be considered users of a page.  While it's not fine-grained, it does represent | 
|  | the idea that someone doesn't want the page to be freed.  It's similar to | 
|  | having a dentry be the refcnt source when someone really wants the inode/file. | 
|  | When we want to remove a page from the block buffer cache, all we do is mark | 
|  | it as not dirty and not up-to-date.  Now, whenever the page gets evicted from | 
|  | the cache, we only write back those block buffers that are dirty, so the | 
|  | recently evicted block will not be written back.  If we attempt to read the | 
|  | block in the future (perhaps it is reallocated as a metablock), then the BH is | 
|  | still there for the mapping, but is simply not up to date.  The code already | 
|  | knows how to handle this (since it could happen during a race condition), and | 
|  | it will simply read the buffer back in.  This new reading is necessary, since | 
|  | it is possible that the block was used for file IO in between the uses of it | 
|  | as a metablock. | 
|  |  | 
|  | If there are more than one thread operating on a block buffer in the page | 
|  | cache, then at the level of the cache, there is a race.  One could be marking | 
|  | it as not dirty while another is dirtying it, etc.  However, no one should be | 
|  | removing a buffer (aka, deallocating a block) while it is in use.  This sort | 
|  | of concurrency problem should be sorted higher in the software stack (like at | 
|  | the inode). | 
|  |  | 
|  | On a similar note, no one should ever remove a block's buffer from the | 
|  | metadata buffer cache if it is dirty.  When those removals happen, it means | 
|  | the block should be dealloced on the block device - meaning no one cares what | 
|  | happens to it.  It's not meant to have data preserved. | 
|  |  | 
|  | x.6: What about Directories?  Inodes or metadata? | 
|  | -------------------- | 
|  | Directories have inodes that have blocks scattered around the disk.  The blocks | 
|  | making up the body of a directory are not sequential, like when you read blocks | 
|  | from a bdev.  If you wanted a "page" of a directory, then you'd need to use the | 
|  | i_mapping (page cache of a file) to access it, like any other file. | 
|  |  | 
|  | However, we don't want a page of a directory - at least not yet.  The main | 
|  | reason we can get away with this is due to the lack of a mmap() or a desire to | 
|  | page-remap for directory-contents IO.  It's all for the kernel's internal use. | 
|  | At no point does anyone call generic_file_read() or _write() on it. | 
|  |  | 
|  | That being said, we treat the blocks of a directory as metadata blocks.  We do | 
|  | figure out which blocks they are by walking the inode (also made of metadata | 
|  | blocks), but don't bother to set up a page map for the directory itself.  We | 
|  | just use the inode to figure out which metadata blocks we want, then read them | 
|  | in (out of the blockdev's page cache). | 
|  |  | 
|  | Note that userspace issues reads on the directory.  This is mostly a convenience | 
|  | thing (or an inconvenience thing), which just ends up being a wrapper around | 
|  | readdir() (via generic_dir_read()). |