| Unorganized Notes on the Virtual File System |
| ------------------------------------- |
| Our VFS is built similarly to Linux's, mostly because I'd like to have somewhat |
| easy integration with ext2 (or at the very least have our own FS that can |
| integrate into Linux easily). |
| |
| There are four main objects in a filesystem: superblock, inode, dentry, and a |
| file. |
| - Superblock: Specific instance of a mounted filesystem. All |
| synchronization is done with the one spinlock. |
| - Inode: represents a specific file |
| - Dentry: in memory object, corresponding to an element of a path. |
| E.g. /, usr, bin, and vim are all dentries. All have inodes. Vim |
| happens to be a |
| file instead of a directory. |
| - File: represents a file opened by a process. |
| |
| So far, dentries are the most complicated, so here are some notes about them. |
| These are probably self-consistent, but may be a bit old (or just wrong) because |
| I wrote them as I put the VFS together: |
| |
| A dentry is just a connection between a part of a path and an inode. We just |
| want to walk from dentry to dentry, asking each to find the next part in the |
| path. When you hit a dentry that's not the end of the desired path, you ask its |
| inode for the next dentry by using it's operation (lookup). |
| |
| lookup() takes the inode of the directory (or 0 for the root) and a dentry |
| (already allocated, with the d_name and i_name filled in), and it will find the |
| correct inode for the given dentry, as well as filling out any specific FS |
| things in the dentry (like d_ops). It will return 0 on failure, or the dentry |
| pointer passed in on success. Somehow the nameidata bit will be used too. This |
| will probably change a bit... Note that lookup() needs to read the actual |
| directory file and also lookup the inode off the disc, which means it will |
| block. |
| |
| When the next dentry is a mountpoint, (e.g. /mnt/cdrom), when you ask mnt for |
| cdrom, lookup() will parse it's file (a directory) and see 'cdrom' there as a child |
| entry. Then it will return cdrom's dentry, like always. |
| |
| But since cdrom was a mountpoint, (which you can see from the d_mount_point), |
| you need to walk through the structures a bit to get the dentry of the FS rooted |
| at that mountpoint to keep walking. The VFS can handle that, so lookup() |
| doesn't need to worry about it. |
| |
| Why are there two dentries associated with a vfsmount? Normally, a dentry for a |
| directory points to an inode that contains its members. When a FS is mounted |
| over it, we don't want to destroy that. Instead, we mark it as a mountpoint, |
| and when following, we walk into the next FS. The mountpoint is the dentry in |
| the parent mount (using the parent mount's FS superblock), and the root is the |
| dentry of the mounted FS, using the FS superblock of the mounted FS. |
| |
| Sometimes lookup() will not be called at all. We want to look for present |
| dentries first (ones that exist in memory already, including all of those with a |
| refcnt). So we can look in the cache (currently just an SLIST, but will |
| eventually be a hash table). When hitting in the cache (hashed by dentry name, |
| and probably the parent dir), we need to verify that the dentry's parent is our |
| own (handled by the VFS code). All vfsmount dentries will be in the cache, |
| since they all have a refcnt (the vfsmount struct points to them). |
| |
| Dentries for / and pwd and whatnot have a refcnt (pointed to by fs_struct, |
| vfsmounts, etc). Anything with a pointer to it has a refcnt, so it never goes |
| away. So if we have an inode in memory, it's entire dentry path is cached, |
| (just follow the parent pointers back). Note that when a dentry's refcnt hits |
| 0, we do *not* deallocate (like we do with many other structures). It will just |
| be on the LRU list in the dcache system. Likewise, every dentry points to its |
| inode, which pins that inode in memory. |
| |
| Other refcnts: just about everything has a refcnt, and we need to be careful |
| about when we want to use them and dealloc. Some things would be a pain in the |
| ass, like with the super_block refcnt. Every dentry has a ref to the SB, but |
| changing that ref every time we add or remove a dentry will probably be an |
| unnecessary penalty if we can ensure all dentries of that FS are gone before |
| removing the superblock through another mechanism. We'll see. Mostly, this |
| just means we need to think about what we really want from a refcnt, and whether |
| or not we want the kref / process style refcnting. |
| |
| Mounting: |
| ------------------------------------------- |
| When you mount, you need to read in the super block and connect the relevant |
| data structures together. The SB is connected to the vfsmount, which is |
| connected to the dentry of the mount point and the dentry of the root of the FS. |
| This means when mounting the FS, we need to create the dentry for "/", which |
| means we also need the inode, which needs to be read_inode()'d in. Actually, we |
| might not need to read the inode in right away - we might be able to get away |
| with reading them in on-demand. |
| |
| All of this means that for every mount point, the SB, vfsmount, dentry, and |
| inode are in memory. Due to the way dentries link, every dentry and inode back |
| up to the real root are all in memory too. Having a mount point is like having |
| a process working in that directory - the chain back up is pinned. |
| |
| d_subdirs: |
| ------------------------------------------- |
| Tracking the links between objects can be tricky. One pain is d_subdirs. Linux |
| only tracks subdirectories. We also do this. I think the reason they do it is |
| since you should be using the dcache to cache lookups, and not scan the linked |
| list of children of a dentry for a specific file. Though it is handy to know |
| all of your *directory* children. In KFS, we also track all children in a list. |
| This is to make our lookups work - instead of having an actual directory file |
| with name->ino mappings. |
| |
| KFS and metadata pinning: |
| ------------------------------------------- |
| KFS pins all metadata ("all paths pinned"). This means that from the root mnt |
| down to the lowest inode, all dentries and corresponding inodes are pinned in |
| memory from creation time. Yeah, this means we have chunks of metadata for |
| files we aren't using sitting around in RAM. We also have the *files* sitting |
| around in RAM too. Not that concerned, for now. Plus, I don't want to reparse |
| the CPIO backing store to figure out inode fields, directory names, etc. |
| |
| Page Cache: |
| ------------------------------------------- |
| Every object that has pages, like an inode or the swap (or even direct block |
| devices) has a page_map tracking which of its pages are currently in memory. |
| This is called a struct address_space in linux, which is a confusing name. We |
| don't have all of the same fields yet, and may be doing things slightly |
| differently, but the basics are the same. Together, the page_maps and the |
| functions to manipulate them make up the Page Cache. Every page frame that is |
| in a page mapping can be traced back to its page_map, via pointers in the struct |
| page. Note the page_mapping is tracked twice for a file, the f_mapping and the |
| i_mapping. We really only need the i_mapping, but this saves a couple cache |
| misses. Might go away later. |
| |
| As a side note, Linux's "address_space" has a great example of the power of |
| their linked lists. Their struct has a private_list head. Due to their list |
| implementation, they are able to have a generic list head for a list of any |
| type (it's a struct list_head), and don't need to declare in a generic object |
| (like the page_map) a specific list type (that would get used by specific |
| FS's). |
| |
| Just because a page is in a page_map, it doesn't mean it actually has the |
| data from the disc in it. It just means that there is a physical frame |
| dedicated/mapped to be a page_map's holder for a specific page of an object |
| (usually a file on disc). readpage() is called to fill that page in with what |
| it is supposed to have from its backing store. |
| |
| This interpretation makes the meaning of "address space" make more sense. It's |
| the "address space" of the device, mapping from (file,index) -> page_frame. |
| Still, calling it an address space just confuses things with the virtual memory |
| address space. |
| |
| One of the reasons pages can be in the map without up-to-date data is due to |
| concurrency issues and outstanding I/O requests. When the page is first being |
| filled up, the mapping exists but the data hasn't been brought in yet. Other |
| processes can be trying to access the same block/page/byte, and they need to |
| block but to not try and schedule the operation. |
| |
| So here's how a typical page fault (__handle_page_fault(), either on demand or |
| populated) works on an mmap'd file at a high level. |
| 1. PF is on a virt address -> translated by the vm_region to a file/offset (page). |
| 1b. Weird permission issues? See below! |
| 2. Look it up in the page_map. |
| 3. If the page is already there, and up-to-date, then great. Skip to 6. If |
| there is one, but it's not up to date, skip to 5. |
| 4. If there is no page, get a free page, tie it (bidirectionally) to the inode |
| in the page_map. |
| 5. Now there is a page, but it is not up to date, so call readpage(). This will |
| usually block. |
| 6. Map that page (which has the current contents) into the address space of the |
| calling process (with the appropriate permissions, RO (MAP_PRIVATE, CoW), or |
| RW (MAP_SHARED). |
| Below: Now if in step 1 you had a permissions issue, such as a RW fault on a CoW |
| MAP_PRIVATE, you'll have to notice the type of error and the type of memory |
| region, then go through a separate path: get a new page, copy the contents, and |
| change the mapping. Note, the page backing that mapping is not backed by the |
| file - it's just sitting around in the virtual memory of the process. |
| |
| Also, if we want to use the PG_DIRTY flag, we'll need mark the regions as RO |
| until we write fault, at which point we dirty the page and change it to RW. |
| |
| We could have multiple threads trying to fill a page in the page cache at once. |
| This is handled in file_load_page(). All threads check the page cache. If two |
| threads try to add it to the page cache, only one will succeed, and the page |
| will be locked (PG_LOCKED). The one who succeeds will readpage(). The one that |
| didn't will be like any other thread that is checking the page cache - it will |
| see a page is there, and will check it the page is up to date. If it isn't, it |
| will try to lock the page so it can do the IO, with a few extra checks in case |
| the page had been removed or was filled in while it slept. |
| |
| A big aspect of this is the use of lock_page() to block. If the page is locked, |
| you block until it is unlocked. (implementation and other issues still |
| pending). Even the caller of readpage will lock after submitting the IO |
| request. This will cause the caller to sleep until the IO is done. When the IO |
| is done, that interrupt handler/thread will mark the page as up-to-date, and |
| unlock the page, which will wake up any of the waiters. The caller of |
| readpage() may not be the first to wake up either. This all means the IO system |
| needs to know about marking pages as up-to-date and unlocking them. This |
| currently (Jul10) is just sitting in KFS, but can be done later either directly |
| or with a callback made by |
| whoever starts the IO. |
| |
| A note on refcnting. When a page is added to the page cache, that's a stored |
| reference. When you lookup a page in the page cache, you get a refcnt'd |
| reference back. When you pull a page from the page cache, you also get a |
| refcnt'd reference back - specifically it is the ref that was in the page map. |
| |
| Files with Holes |
| -------------------------- |
| If a file has a hole, we'll still try to look up the page in the page cache. |
| When that doesn't happen, we'll create and add a page, then call readpage(). |
| Readpage will realize there is no page on disk/backing store for that part of |
| the file (since it was a hole) and just memset the page to 0. In the future, we |
| can consider getting a CoW 0-page, but that's a bit premature and a bookkeeping |
| pain. |
| |
| This also applies to trying to write to a block beyond the EOF. If the request |
| hits the page cache and readpage(), it's because it was already checked and |
| cleared in another part of the VFS, such as in generic_file_write(). |
| |
| Files That End Before a Page Boundary |
| -------------------------- |
| So what if we have a file that is only 1024 bytes. When we read it in, we'll |
| have a whole page added to the page cache, with the extra 3K 0'd. When we go to |
| write it back, it will write back the 1K, and ignore the rest. But what if we |
| extend the file later? That page is already in the page cache, but the buffer |
| heads aren't tracking the new 3K. When that page gets evicted, only the 1K will |
| write back. |
| |
| There are two ways to deal with this: 1) When we extend the file, check and see |
| if it is in the middle of a page, and if so, alloc the blocks (all FS |
| specific!), and adjust the BHs to map the new data. 2) Preallocate the |
| remaining blocks and stitch up the BH mapping at readpage (which is in the |
| specific FS). This way, we reserve the blocks (but don't have to actually read |
| them in - we just mark the buffer as dirty). When we grow a file, we don't need |
| to worry about any of these details. We just make sure the page map has the |
| page, and not whether or not it's a half-page that needs a FS-specific solution. |
| I'm going with #2 for now. Note that while the blocks are allocated, the file's |
| size is still 1024B. |
| |
| Kref, Dentries, Inodes, and Files (or "I don't see why it's like X, but..." |
| -------------------------- |
| There are multiple dentries pointing to an inode. The dentries are (or will be) |
| cached too, but that is irrelevant. The dentries pin the inodes in memory. |
| However, files pin inodes in memory (or they did) briefly. After running around |
| in circles a bit, I asked, why doesn't the file pin the dentry instead of the |
| inode? The answer: it is supposed to. Linux does that, and I didn't because |
| pinning the inode made more sense at the time. |
| |
| The heart of the issue is about understanding what files are and how they |
| relate to the rest of the VFS. A 'file' in the OS is a structure to track an |
| open FS-disk-file, which is managed by the inode. Given this, it makes sense |
| that a dentry (which is a name on a path) would be pinned by the corresponding |
| inode, and the file would pin the inode. It doesn't, but it is believable. In |
| reality, there are many names (dentries) for a given disk file, and the OS file |
| that you open corresponds to *one* of those names, and thus a dentry, and not to |
| the inode/specific file. You need to go through the dentry to pin the inode. |
| |
| In short, it's not: file -> inode -> dentry -> parent_dentry -> ... |
| It's file -> dentry -> parent_dentry -> |
| |-> inode |-> parent_inode |
| Another dentry and file (both OS and disk) can point to the same inode. If you |
| don't do it this way, you can't pin up past the multi-dentry-point in the inode, |
| and your system doesn't really make sense. |
| |
| So here is how it works: files pin dentries. Dentries can pin other dentries, |
| on up the directory hierarchy. Independently of the files, dentries pin their |
| inode. There are many dentries per inode (or can be). Since each dentry |
| doesn't know if it is the last dentry to decref the inode, we use a kref on |
| i_kref. The inodes are storing references to the dentries, but they are the |
| kref "internal" / weak references. Even if we did decref them, we don't trigger |
| anything with it. |
| |
| The moral of the story is that if you don't fully understand something, you are |
| not in as good of a position to recommend changes or criticize as if you did |
| your homework. Not that you can't, just that you should 'do your homework.' |
| |
| Musings on path_lookup() |
| -------------------------- |
| Things can get tricky with path lookup, especially with ., .., and symlinks. |
| When doing a LOOKUP_PARENT on a . or .., we give the parent of whatever the path |
| would have resolved too. So /dir1/dir2/dir3/.'s parent is dir2. |
| /dir1/dir2/dir3/..'s parent is dir1. I don't think Linux does this (note the |
| parent lookup is for internal kernel stuff, like when you want to edit |
| metadata). When you try to make a . or .. file, you should get some sort of |
| error anyways. We'll see how this works out. |
| |
| Symlinks can be a bit tricky. We handle ours a bit differently too, especially |
| regarding PARENT lookups. Ultimately, you can do the same things in ROS that |
| you can do in Linux - if you try to create a file that is a dangling symlink, |
| you'll correctly create the destination file. We handle this in |
| link_path_walk(). It will return the PARENT of whatever you would resolve to - |
| instead of trying to handle this in do_file_open() (which I think linux does). |
| |
| Also, our handling of symlinks differs a bit from linux. Eventually, it has |
| become clear we're going to need to manually port ext2, and we do some things |
| differently in our core VFS already. Might as well do more thing differently - |
| like getting rid of follow_link and put_link from the FS specific sections. Our |
| FSs just need to know how to return a char* for a symname - and not do any of |
| the actual link following. Or any of the other stuff they do. We'll see if |
| that turns out to be an issue or not... |
| |
| Unlinking and other Link Stuff |
| ------------------------- |
| Unlinking is just disconnecting a dentry-inode pair from the directory tree, and |
| decreasing the inode's i_nlink. Nothing else happens yet, since we need to keep |
| the FS-file (controlled by the dentry/inode) so long as any OS-files have it |
| open. They have it opened via open or mmap - any way that there is a reference |
| to a file, which then pins the dentry and inode. When the OS-files close, |
| eventually the dentry's refcnt hits 0. When it does, it normally would be up |
| for caching, but we can check nlinks and just drop it. When that happens, it |
| releases the inode, which will see its nlinks is 0. That will trigger the |
| underlying FS to clear out the FS-file. |
| |
| For directories, you can only have one hardlink to a directory - meaning you are |
| only in the directory tree in one place. However, all of your children can get |
| to you by going '../'. We'll count these as hardlinks too. This means that |
| every child increments its parent-dir's nlink. This is the on-disk links, not |
| to be confused with the dentry->d_parent and kref() business that goes on for |
| the in-memory objects. A directory cannot be removed if nlinks > 1. If it is |
| 1, then you can rmdir it, which will set its nlinks to 0. Then its inode's |
| storage space will get freed when it is deleted, like any other inode. In |
| theory. |
| |
| Inodes: drop? delete? dealloc? |
| -------------------------- |
| Inodes exist both in memory and on disk, but in different manners. When a file |
| (F_WHATEVER, could be DIR) exists in an FS, it'll have an inode on disk. When |
| it is time to delete that file, we call _delete_inode(). When we want to free |
| the memory associated with an in-memory (VFS) inode, we call _dealloc_inode(). |
| |
| What about drop_inode? For now, we don't use it. We have inode_release() in |
| the VFS. If we need an FS specific one (like for ext2), or have FS-specific |
| work that needs to be done in inode_release(), we'll use it later. |
| |
| Either way, inode_release() is called when we no longer use the in-memory inode. |
| If there are no hard links, it will delete the inode. Either way, it will just |
| free the in-memory inode (after deleting the disc version). |
| |
| Example: I want to unlink (rm) a file. There are two cases: the file is already |
| open (with a dentry and the inode in memory) or the file is not. Both cases are |
| handled the same way! In either case, we eventually call do_lookup on the item |
| in question, getting both a dentry and its inode read in. (We read it in for a |
| couple reasons: convenient to check the type, and we need to manipulate the |
| nlink). If a process has the file open, or even if it is sitting in the cache, |
| we will get the same inode (from the inode cache, might not be implemented yet). |
| When we decref the dentry and it is done, it will decref the inode. This |
| dentry's final decref will be deferred until any open files are closed. Note, |
| this requires a working dentry/inode-cache - otherwise we'll have multiple |
| copies of the same FS/disk-inode (and possibly dentry). Anyway, when this is |
| done, the release function will delete the inode, then dealloc it. |
| |
| Another example: We simply close a file. When that happens, we decref the |
| dentry, which decrefs the inode. It may remain cached for a bit - not a big |
| deal. When it is finally removed, nlinks is positive, so the inode's in memory |
| copy is written back (if it was dirty) and the structure is deallocated. |
| |
| Side notes: dentry cached inodes should be removed after their lookup in unlink. |
| Also, since multiple dentries point to the same inode, it's not enough to just |
| cache dentries - we need to be able to find inodes too so that we get the one |
| inode regardless of which dentry we use (which may be uncached). |
| |
| Dentry and Inode Caches |
| -------------------------- |
| The dentry caches dentry lookups - we need the parent and a hash of the name to |
| do a lookup. The dcache consists of a hash table for the lookups, as well as an |
| extra list of entries that are unused (their kref is 0). The dcache also caches |
| negative entries - entries that were wrong. This still speeds up future |
| requests. Most uses of the system just need to use dcache_get and dcache put. |
| Not all of this is implemented yet. |
| |
| The inode cache is similar, though we can't just have the inodes hang off the |
| dentry cache (more than one dentry points to the same inode). We don't need to |
| worry about unused lists or anything like that - once the kref hits 0, we're |
| done and we can rip it out of the hash. |
| |
| Both hashes hang off the superblock, with concurrent access protected by locks |
| in the SB. |
| |
| The dentry cache is the weirdest of them all - for normal entries, its key and |
| value are the same thing. The actual hashing of a dentry is done by the qstr |
| value, and to determine equality, we need to compare parents (compared to the |
| inode cache, where the only thing that matters is the i_ino). Put another way, |
| we need elements of the whole dentry to get a unique key (d_parent and d_name). |
| |
| As stated above, the dcache also caches negative entries. This is to prevent a |
| lookup on disk. These negative entries are in the dcache and on the LRU list |
| (their refcnt is 0, the are not USED). When we dcache_get, we don't bother with |
| returning the actual dentry (after increffing) and then decref it again. |
| Instead, we just return the negative result (via the query dentry, |
| incidentally). |
| |
| Freeing of dentries happens in one of two ways: call __dentry_free() directly, |
| which is appropriate when you have the only copy (like in do_lookup()), or it |
| will get freed when the dcache gets reaped (the LRU entries are freed). When it |
| is decref'd, it simply goes into a state where it is ready to be reaped, but |
| kept around for future lookups - most usages throughout the vfs can just decref |
| when they are done using it. |
| |
| One complication is cached negative dentries. These are only referenced once |
| (in the dcache), so they can get __dentry_free()d directly. This gets tricky |
| with rmdir and unlink. Initially, those functions marked the dentry as negative |
| and unused, and would let them stay in the dcache (returning negative results on |
| future lookups). The problem with this is that now the dcache could have a |
| negative dentry that was a real, formerly used dentry - one with a refcnt that |
| needs to be decref'd and released. |
| |
| There are two solutions: one is to change the dcache to not assume that negative |
| entries are unreferenced (which also means on the LRU). The other is to just |
| remove the dentry from the dcache on rmdir/unlink. It won't be negative - and |
| that won't matter, since it is un-lookup-able. And it will die nicely when it |
| gets decref'd. All we'll do is add a DENTRY_DYING flag, and dentry_release() |
| will avoid LRU and unusing it. The dcache can continue to assume that negative |
| entries are unused/LRU/dentry_freeable/ref==0, and not worry about calling |
| kref_put(). |
| |
| X: Buffer Cache, Page Cache, Buffer Heads, WTH? |
| ------------------------------- |
| X.1: Page vs Buffer Cache |
| -------------------- |
| So we discussed the page cache, but as described, it does not satisfy all of |
| our disk caching needs. Traditionally, there would also be a 'buffer cache.' |
| Buffers usually refer to memory used to hold data from the disk (or network). |
| I can think of a couple different ways someone could think of the buffer |
| cache, and to understand them, we first need to understand what we need to be |
| caching. |
| |
| There are two main types of FS data: file data and metadata. This metadata |
| is FS-specific data accessed by the VFS to get to a file's contents. For |
| example, the superblock, inodes, inode indirect blocks, and to a certain |
| extent the directory's contents are all metadata. There isn't an FS file (not |
| to be confused with an OS file) that corresponds to this data, but it needs to |
| be read in and cached. Furthermore, this cache needs to be managed and |
| written back when dirty. Note that file data can be broken up into two |
| different types: read()/write() data and mmap'd data. When people talk about |
| buffer caches versus page caches, they are talking about these two different |
| types of file data (at least NetBSD did |
| http://www.usenix.org/event/usenix2000/freenix/full_papers/silvers/silvers_html/). |
| |
| A very simple buffer cache would include anything *ever* read from disk. It |
| would then get copied into the page cache in PGSIZE chunks for the page cache. |
| This would suck, since we now have two or three copies. We obviously want to |
| use the page cache for both mmap() and read/write(). It's not clear about the |
| metadata. |
| |
| Another usage of a buffer cache would be to cache only the disk blocks needed |
| for metadata. I think this is what Linux did before it unified its buffer and |
| page caches (implied in UTLK). The main issue with this is that you have two |
| different systems for managing essentially similar data - we only want to deal |
| with flushing, syncing, and writebacks of one subsystem, not in multiple |
| different subsystems. |
| |
| The solution is to use the page cache to cache the metadata blocks' buffers. |
| We do this by having the block device be 'mapped' (but not read) in PGSIZE |
| chunks through its own struct page_mapping (a.k.a. struct address_space in |
| Linux). This way, both files and the block device are mapped in PGSIZE chunks |
| via the same page_mapping structure, and will be managed by the same code. |
| |
| Sort of. We don't actually read in PGSIZE chunks for the block buffer cache. |
| If blocks will be in the bdev's cache, then they will be adjacent and on the |
| same page. It is possible some adjacent blocks (which would be on the same |
| page) are not both metadata. |
| |
| A more accurate way to describe what we do is that metadata blocks are copied |
| into a 'buffer cache' that is mapped and managed similarly to the page cache. |
| Pages are made of buffer heads, which hold data, and the reclaiming of pages of |
| memory from either the page cache or the buffer cache will use the same code - |
| since they are both just made of buffer pages. |
| |
| X.2: Mapping Blockdev Data vs File Data |
| -------------------- |
| An important distinction between file data (as managed by an inode) and the |
| bdev is that PGSIZE chunks of data for the bdev *must* be made of contiguous |
| disk blocks. Ideally, file data is also contiguous/sequential, but that is |
| not always the case - hence the need for the inode's block pointers. This |
| means that the chunk at the bottom of the page_mapping radix tree is a page, |
| and on that page there may be several buffers, holding the data of |
| nonsequential disk blocks - but that not all pages are like this. The bdev |
| pages are made up of sequential blocks due to the very nature of what we are |
| mapping; there's no inode abstraction in between. |
| |
| There are a couple ways we could handle this. We adopt the Linux approach of |
| using something called a buffer head (BH), which describes the mapping from |
| in-memory buffer to block device / block number. These are slab-allocated, |
| and exist for each buffer of a page. The page itself points to the first of |
| its BHs, all of which exist in a LL. The maximum number of them is determined |
| by PGSIZE / blocksize. Whenever there is a page in the page cache (meaning, in |
| a page_mapping), that is up to date, it will have a BH. |
| |
| Another way would be to not have BHs at all, and just figure out (at |
| operation-time) what the n blocks on disk are for any given page, and submit |
| the IO operations for those blocks. The BHs serve as a cache of that info. |
| They also serve as a location to store data about the mapping, such as whether |
| it is dirty or not, whether the data is up to date or not (has it been read |
| in), etc. The caching ought to be rather beneficial, since it means we do not |
| need to access the disk-inode and indirect blocks whenever we want to perform |
| an operation (which may be frequent - imagine the periodic writeback of an |
| mmap'd file undergoing writes). The per-buffer dirty tracking will also help: |
| no need to write unrelated blocks if only one is edited (though this will not |
| help with mmap(), since we don't usually know which part of the page is |
| written). |
| |
| X.4: Do we always need BHs? |
| ------------------ |
| But what about those pages made up of contiguous blocks? We don't have a |
| bunch of independent, non-sequential blocks that we need to map. Do we need a |
| bunch of BHs for that? Do we need any? It really seems excessive to break it |
| up into separate buffers for no reason. At the other extreme, we could get by |
| without having a BH at all, though this gets back to the other issue of |
| caching. What we do (or will do) is have one BH for the entire page of |
| contiguous blocks. If the page is a "buffer page," in Linux terms (meaning it |
| has separate buffers), it will have n BHs in a LL. Either way, we'll always |
| have the mapping handy. We wouldn't need to re-verify the contiguous nature of |
| the blocks anyways, since the fact that the page was up to date and didn't need |
| a BH would mean it was contiguous. Further benefits to using the one BH |
| include: 1) we are likely to make BHs be the unit of IO *submission*, and having |
| one handy will simplify that code. 2) some code paths within the VFS may get BHs |
| as a return value, which they can then dirty. Always having a BH makes this |
| easier (no need to find out if it's a buffer page, then decide how to dirty it). |
| |
| Another compliation with this is certain code will want a block in the middle |
| of a page (very common for metadata). That code will get the BH for the |
| entire page back. It will need to determine the starting block and then the |
| offset of the block it wants. Note, this usage of the BHs is finding the |
| buffer corresponding to a block number. The BH is a bidirectional mapping |
| between buffers and blockdev:blocknum. These calculations are a minor |
| complication, but easy enough to do, and will probably be worth it. The |
| tradeoff is against having multiple BHs, which would mean multiple block IO |
| requests for writing a single page (and writing the whole page may be a common |
| operation). |
| |
| X.3: What about opening a blockdev as a file? |
| -------------------- |
| Basically, don't do it for now. The way we actually do things is a buffer cache |
| page is "up to date", but has no BHs or data until a specific block is |
| requested. This is because we don't always know if all the blocks mapped by a |
| page are actually metadata blocks. If they aren't we'll have issues where we |
| read in extra blocks that exist in both a file's page cache and the block |
| device's buffer cache. |
| |
| A safe invariant is that a given block will only ever be in one cache: either a |
| file's page mapping or the buffer cache's page mapping. When these blocks are |
| freed, we'll need to rip them out of their mapping (and possibly flush them). |
| |
| There is one easy way to avoid this: don't open a bdev file if a file system is |
| mounted on top of it. If you do, don't be surprised about inconsistencies. |
| Ideally, the FS will never leave the actual disk in an inconsistent state, but |
| the bdev's page cache could read things at different times and get some weird |
| results. Just don't do this (for now - not like I plan to make this possible |
| any time soon). |
| |
| Could we use the same buffers for both the blockdev-file page mapping and the |
| inode-file page mapping? No - because the inode-file has the inode |
| indirection in between, which means a PGSIZE chunk of the file might not be |
| contiguous (as mentioned above). |
| |
| We could have tried to avoid this bdev file problem by having the page mapping |
| radix trees point to a set of BHs that describes that page worth of buffers, |
| so that the bdev and an inode pointing to the same data will use the same |
| buffers and BHs. That won't work, since the files buffers/blocks aren't in |
| the same order as they are on disk within a page. Instead, we'd need to have |
| the page mapping go down to the FS's blocksize granularity, not to PGSIZE, so |
| that we could have independent leaves point to wherever they want. This would |
| push the specific block device's blocksize into the VFS (which I don't like). |
| |
| But the blocksize-infecting-the-VFS alone isn't too bad. The real issue is |
| what is at the end of the page mapping. Is it a page or a buffer/BH? We want |
| pages for two related reasons: higher levels of the kernel's IO systems deal |
| with pages: |
| 1. mmap(). Good old mmap() requires at least a page of contiguous block |
| buffers, so that the physical page frame can be mapped directly into a |
| process's address space. |
| 2. Fast IO. We want to use page remapping for fast IO. This means that IO |
| has to be in PGSIZE chunks - not blocksize chunks. |
| |
| x.4: Managing Dirty Buffers |
| -------------------- |
| Many (some/all?) of the block functions will return a BH to describe the |
| mapping. One reason for doing this (mentioned above) is to allow the caller |
| to manage if a buffer is dirty or not. The tracking of dirty pages is done by |
| the page cache. The dirtying is done by the caller; it can simply mark the BH |
| dirty and forget about it. The writeback is handled usually by the page cache |
| or is triggered by an FS event. Eventually, we'll have a kernel thread that |
| periodically writes dirty buffers back to disk. Processes can also do this by |
| calling fsync. FSs themselves will trigger syncs of metadata. This will come |
| from having dirty SBs and inodes in the VFS. |
| |
| Note, the issue of whether or not we pin metadata blocks, such as the inode's |
| indirect blocks (or other related blocks), in the page cache is independent of |
| all these issues. If they are not cached / pinned, we would just have to |
| block and reread the inode's structures back into memory and proceed (for |
| instance, we'd do this when identifying blocks for a page mapping on a file |
| read). The reason we wouldn't want to pin them is to save memory. |
| |
| x.5: Reference Counting BHs and Pages |
| -------------------- |
| There are a lot of complications with this, and if there is a good reason, |
| we'll change this later. |
| |
| x.5.1: Basics |
| ----------- |
| So we talk about passing around BHs, both when submitting IO ops and when |
| returning from things like get_buffer(). However, currently, we do not kref |
| or reference count BHs in any way. Instead, we kref pages. We do this (for |
| now) for a couple reasons: |
| 1) Pages are the unit of memory management in the kernel. Higher levels of |
| the kernel will pin/incref the page (such as in an mmap()). |
| 2) BHs hang off of a page, but exist only to be expressive about the |
| management of the buffers on the page. It's not like how a file hangs off a |
| dentry, where the dentry doesn't know (or care) about the file. |
| 3) We already refcount pages. While we could do the same for the BHs, it is a |
| bit redundant. Any place that would be kreffing the BH can just kref the |
| whole page. Code that receives a BH as a return value is actually getting a |
| page under the covers (though should use put_buffer() to drop its reference). |
| |
| x.5.2: How we refcnt pages in a page mapping |
| ----------- |
| When a page is in the page cache, we give it one kref. Whenever a function or |
| subsystem is using one of these pages (IO, using the data, etc), there needs |
| to be a kref. When it is time to remove the page from the page mapping, the |
| code needs to remove it from the radix tree, then kref_put it. When the final |
| user is done with it (ideally, there is none, but we need to handle the case) |
| the release function will clean up the page - including freeing its BHs. |
| |
| This does suck in that we could be removing an item from the page cache while |
| it is being used, violating some LRU policy. The actual decision to remove it |
| should use those policies (when possible); the kreffing is simply to avoid |
| issues from race conditions. (A reader starts using a page right before it is |
| ripped from the mapping). |
| |
| x.5.3: More issues with Evictions |
| ----------- |
| One issue with this is that dirty pages/buffers will need to be written back. |
| If someone tries to read while the page is removed from the page_mapping, but |
| before it is written back, they could get an old version if the read happens |
| before the write. This is only an issue if the page is dirty. One option |
| would be to writeback the page/buffer, then later remove it from the page |
| cache when it is read. There's issues with concurrent writers, though if that |
| happens, we probably don't really want to remove it (it was LRU). Note this |
| is an issue regardless of whether or not BHs are refcounted. Also, this is |
| not an issue for when we kick a dentry/inode out of the cache - there should |
| be no one else trying to use it (since their refcnt was 0). This is just a |
| concern when the system runs low on memory and wants to reclaim potentially |
| memory held by caches. |
| |
| Also note that writeback of pages will happen regardless of eviction plans |
| (fsync, every n sec, etc). |
| |
| This refcounting pattern is very similar to unlinking, where you can continue |
| to access a file once it is removed from the directory. The difference here |
| is that future requests will get the same object (the page), unlike in the FS, |
| where you get ENOENT. The page mapping is a cache, and you need to get the |
| same old data that was in there before the eviction. |
| |
| A final issue is when the VFS aggressively pins blockdev (metadata) buffers. |
| Ideally, we'd like to be able to expel pages/buffers even if they are |
| refcnt'd. The subsystems will always want to keep stuff in RAM. This |
| also/especially applies to mmap(). One solution would be to keep them in RAM, |
| but have the BH keep track of who is holding its reference. Then we could |
| unmap the page, which would need to get read back in on its next access. We'd |
| need (or ought to have) some sort of callbacks. This will get solved later |
| when we deal with unmapping mmap'd files, since it is the same problem - just |
| with different code and actors. |
| |
| x.5.4: What about buffers inside pages? |
| ----------- |
| For a while, I thought about refcounting BHs/buffers. The issue that drives |
| it is the buffer cache (block dev page mapping holding metadata blocks of a |
| FS). We had been operating on the cache in page-sized chunks, which |
| erroneously was reading in blocks adjacent to metadata blocks. This would |
| have been an issue when we write back pages that have dirty blocks; blocks |
| were erroneously in the metadata cache and would overwrite potentially |
| file-realted blocks that were in an incoherent cache (the file/inode's page |
| mapping). |
| |
| We broke the bdev's buffer cache up into smaller BHs, so that only metadata |
| blocks get read in, but we eventually will have to get rid of a metadata block |
| (and not the entire page from the cache) (ex: an inode is removed from the |
| disk - its indirect blocks need to go, and they could be next to anything on |
| disk). The desire for the BH refcnt came from wanting to rip the BHs out of |
| the list when it was time to evict them from the cache (in case they became |
| file blocks later). It isn't clear what the best way to do this is. Probably |
| we'd have all users refcnt the BHs, which refcnt the pages. However, some |
| users (like mmap and the file page cache) operate in page chunks - so this |
| would require them to incref and decref the BHs. Triggering the page reclaim |
| might also be tricky. One option would be to just rip a page from the cache, |
| callback to whoever has it loaded so they fault it back in later, and |
| sleep-block everyone who interferes with the operation. Yikes. |
| |
| Another option was to forget about the BH <-> page crap for the buffer cache, |
| and just have a standalone buffer cache with BHs as its unit of operation |
| (instead of a page), and replicate the algorithms/code for the buffer cache. |
| There is still a notion of BHs in a page, and page_release() / page_free() |
| would probably have to be a little different since its page isn't really a |
| PG_BUFFER (but it really is). |
| |
| Instead, here's what we do: we refcnt at page granularity, since all users can |
| be considered users of a page. While it's not fine-grained, it does represent |
| the idea that someone doesn't want the page to be freed. It's similar to |
| having a dentry be the refcnt source when someone really wants the inode/file. |
| When we want to remove a page from the block buffer cache, all we do is mark |
| it as not dirty and not up-to-date. Now, whenever the page gets evicted from |
| the cache, we only write back those block buffers that are dirty, so the |
| recently evicted block will not be written back. If we attempt to read the |
| block in the future (perhaps it is reallocated as a metablock), then the BH is |
| still there for the mapping, but is simply not up to date. The code already |
| knows how to handle this (since it could happen during a race condition), and |
| it will simply read the buffer back in. This new reading is necessary, since |
| it is possible that the block was used for file IO in between the uses of it |
| as a metablock. |
| |
| If there are more than one thread operating on a block buffer in the page |
| cache, then at the level of the cache, there is a race. One could be marking |
| it as not dirty while another is dirtying it, etc. However, no one should be |
| removing a buffer (aka, deallocating a block) while it is in use. This sort |
| of concurrency problem should be sorted higher in the software stack (like at |
| the inode). |
| |
| On a similar note, no one should ever remove a block's buffer from the |
| metadata buffer cache if it is dirty. When those removals happen, it means |
| the block should be dealloced on the block device - meaning no one cares what |
| happens to it. It's not meant to have data preserved. |
| |
| x.6: What about Directories? Inodes or metadata? |
| -------------------- |
| Directories have inodes that have blocks scattered around the disk. The blocks |
| making up the body of a directory are not sequential, like when you read blocks |
| from a bdev. If you wanted a "page" of a directory, then you'd need to use the |
| i_mapping (page cache of a file) to access it, like any other file. |
| |
| However, we don't want a page of a directory - at least not yet. The main |
| reason we can get away with this is due to the lack of a mmap() or a desire to |
| page-remap for directory-contents IO. It's all for the kernel's internal use. |
| At no point does anyone call generic_file_read() or _write() on it. |
| |
| That being said, we treat the blocks of a directory as metadata blocks. We do |
| figure out which blocks they are by walking the inode (also made of metadata |
| blocks), but don't bother to set up a page map for the directory itself. We |
| just use the inode to figure out which metadata blocks we want, then read them |
| in (out of the blockdev's page cache). |
| |
| Note that userspace issues reads on the directory. This is mostly a convenience |
| thing (or an inconvenience thing), which just ends up being a wrapper around |
| readdir() (via generic_dir_read()). |