All, I feel that should I advocate the design which Ogi has used to the core developers, I must demonstrate that it is the superior solution. This means critiquing the extant proposals. In this email, I take a look at how Roland's proposal works and suggest some potential short comings. In the near future, I will review Thomas' proposal. I hope that if I have misunderstood anything, Roland or someone who understands his ideas about this topic better than I will correct me.
The most elaborate description of his approach (in fact, the only real description directly from them which I can find) is available at: http://lists.gnu.org/archive/html/hurd-devel/2002-03/msg00035.html As I understand it, Roland suggests that we map all of the metadata (and only the metadata as much as possible) into the file system's address space at once. At the same time, each types of metadata is grouped together to provide more convenient access thereby hiding the underlying layout of the file system from the front end. This requires that the pagers be more intelligent: rather than just worrying about chunks of data on the backing store, they now know about the contents of data they manage. In this way, there would be a pager for the group descriptors, one of each of the block and inode bitmaps and a fourth for the inode table. These pagers do not directly access the backing store but instead use a large memory object, the disk pager, which is mapped only as need (as opposed to now where the entire disk pager is mapped into the address space). Part of reorganizing the metadata and using this hierarchical paging system includes making locating the blocks which compose a file much simpler. Right now, the front end needs to look up a block and resolve indirection. For instance, if we want to find a block in a given file, we use this algorithm: if (block < 12) return inode->direct_blocks[block]; if (block < 12 + block_size / 4) return inode->indirect_block[block - 12]; if (block < 12 + block_size / 4 + block_size / 4 * block_size / 4) return inode->double_indirect_block [(block - block_size / 4 - 12) / (block_size / 4)] [(block - block_size / 4 - 12) % (block_size / 4)]; return inode->triple_indirect_block[...] As you can see, there is a lot of indirection. With Roland's plan, we are able to succinctly expression this: inode->block[block] inode->BLOCK is a memory mapped region backed by a pager which reorganizes this metadata into a simple array. Thus, when we lookup block, say, 1200, assuming a 4k block size, we have double indirection. The block we are looking for is pointed to by: inode->double_indirect_block[(1200 - 4096 / 4 - 12) / (4096 / 4)] [(1200 - 4096 / 4 - 12) % (4096 / 4)] or: inode->double_indirect_block[0][188] Under Roland's scheme, when we access inode->block[block] and it is not faulted in, the pager backing inode->block[block] gets a pager_read_page request for the aforementioned offset. The pager sees the offset and faults in the first page from the double indirect array and then the first page from that. Hence, we traverse three levels of the paging hierarchy. This all seems rather complicated, however, there is a large win here: subsequent accesses will not undergo, as currently happens, this procedure as long as the block array is in core. Currently, everytime we access a block, we must traverse the entire hierarchy. This way, the kernel functionally keeps a cache of the calculation! On the surface, Roland's technique has two major failings. First, he assumes that all metadata can easily be mapped into the task's address space: A while back I did some off-the-cuff calculations of the ext2fs metadata. I don't recall the numbers, but I satisfied myself that a workable amount of metadata (i.e. under a gigabyte or so) would cover up to terabytes of total filesystem data. My calculations suggest that metadata accounts for, in the worst case scenario, a bit more than 1/14 of the disk. Let's assume the worst case scenario: a full disk with the highest ratio of allocated indirect blocks to file data blocks and ask, how large can the file system possibly be without exhausting virtual memory? An inode structure is sblock->s_inode_size bytes large, however, we can eliminate this abstraction as the only supported size is 128 bytes. This structure lies in the inode table (metadata which we are easily able to account for up front). As well as containing the permissions and mode also addresses 12 direct blocks, 1 indirect, 1 double indirect and 1 triple indirect block. The highest indirect block to file block ratio is addressing 1 file block with the indirect block. Here we have a ratio of 13 file blocks for one indirect block. We can never use just 2 indirect blocks, so let's not even consider this case. We can use three. In this case, the 12 direct blocks are allocated, the indirect block is filled as well as the double indirect block and a single indirect block within that. In this case, we need 10 + sblock->block_size / 4 (a disk block address is fixed to 4 bytes) + 1 file blocks. sblock->block_size / 4 is either 256, 512 or 1024 (corresponding with the supported block sizes). In the worst case, using 1024 byte disk blocks, we have 12 + 256 + 1 file blocks per 3 indirect blocks. This is a monotonically increasing function, hence, we have found the worst metadata block to file data block ratio. Let's assume all files use a single indirect block (i.e. each file contains 13 blocks and therefore uses 1 indirect block). We always map the inode tables, the inode and block bitmaps and the group descriptors. mke2fs typically allocates one inode structure per two disk blocks. Our full disk will have then a significant number of free inodes even though the disk blocks are all allocated. There will be total_blocks / 2 inode structures (total_inodes) or total_inodes / 128 blocks allocated to the inode tables plus, total_inodes / block_size allocated for the inode bitmap and total_blocks / block_size allocated to the block bitmap. We also need to add in a dozen or so blocks for the super block and the group descriptors, but this is insignificant in comparison. Plugging in some numbers, we see that, in the worst case, a file system will be 1/13 metadata all of which we have to map under Roland's scheme. Assuming we have room for 2GB of metadata mappings, we could raise the limit to about 26GB. But this is a fabricated case. Instead, lets consider a real file system, like my /home file system. Here the pruned output of `tune2fs -l': Inode count: 15433728 Block count: 30862873 Free blocks: 16923206 Free inodes: 15420825 Block size: 4096 Fragment size: 4096 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16384 Inode blocks per group: 512 Inode size: 128 This is a 120GB partition. As we can see, there are 15433728 inodes and each inode is 128 bytes. This yields a whopping 1.88GB of inode tables. Add to that the inode bitmap table (14MB) and the block bitmaps (28MB) and we are at 1.93GB. This is just the fixed metadata, assuming everything aligns perfectly. If we add to that the indirect blocks, we will quickly exceed the 2GB virtual memory limit. The proposal's second problem is the organization of the metadata: when we look at the inode->block array above, exactly how are we to organize it? We know that we have the first 12 entries directly in the inode. We have to keep everything block aligned, so the only solution that I can envision while still maintaining the aesthetically pleasing array is to allocate a block of memory and stuffing the block addresses to the last 12 * 4 bytes and then having the rest follow naturally. This would add more logic to the pager especially when flushing data to backing store. However, there is still some hair to deal with when the block size is less than the page size: the indirect block will coexist on the same page as the first 1 or 3 double indirect blocks. This is manageable, however, rather awkward. A real solution would be to maintain some offsets as well as the array and use them in conjunction with the level of indirection. Thus getblk becomes: if (block < 12) return inode->block[inode->direct_offset + block] else if (block < 12 + block_size) return inode->block[inode->indirect_offset + block - 12] etc. This becomes really hairy when we consider block sizes less than vm_page_size: the paging hierarchy begins to work against us. If we could use store_read, we could take (block % block_size != 1) and read them into memory at the right offset, however, now, we must maintain an array of offsets for doing a gather is impossible without memcpy'ing the data. Thanks, Neal _______________________________________________ Bug-hurd mailing list [EMAIL PROTECTED] http://lists.gnu.org/mailman/listinfo/bug-hurd