Review of Roland's >2GB ext2fs proposal

Neal H. Walfield Fri, 13 Aug 2004 13:05:30 -0700

All,

I feel that should I advocate the design which Ogi has used to the
core developers, I must demonstrate that it is the superior solution.
This means critiquing the extant proposals.  In this email, I take a
look at how Roland's proposal works and suggest some potential short
comings.  In the near future, I will review Thomas' proposal.  I hope
that if I have misunderstood anything, Roland or someone who
understands his ideas about this topic better than I will correct me.


The most elaborate description of his approach (in fact, the only real
description directly from them which I can find) is available at:

    http://lists.gnu.org/archive/html/hurd-devel/2002-03/msg00035.html

As I understand it, Roland suggests that we map all of the metadata
(and only the metadata as much as possible) into the file system's
address space at once.  At the same time, each types of metadata is
grouped together to provide more convenient access thereby hiding the
underlying layout of the file system from the front end.  This
requires that the pagers be more intelligent: rather than just
worrying about chunks of data on the backing store, they now know
about the contents of data they manage.  In this way, there would be a
pager for the group descriptors, one of each of the block and inode
bitmaps and a fourth for the inode table.  These pagers do not
directly access the backing store but instead use a large memory
object, the disk pager, which is mapped only as need (as opposed to
now where the entire disk pager is mapped into the address space).

Part of reorganizing the metadata and using this hierarchical paging
system includes making locating the blocks which compose a file much
simpler.  Right now, the front end needs to look up a block and
resolve indirection.  For instance, if we want to find a block in a
given file, we use this algorithm:

        if (block < 12)
          return inode->direct_blocks[block];
        if (block < 12 + block_size / 4)
          return inode->indirect_block[block - 12];
        if (block < 12 + block_size / 4 + block_size / 4 * block_size / 4)
          return inode->double_indirect_block
                        [(block - block_size / 4 - 12) / (block_size / 4)]
                        [(block - block_size / 4 - 12) % (block_size / 4)]; 
        return inode->triple_indirect_block[...]

As you can see, there is a lot of indirection.  With Roland's plan, we
are able to succinctly expression this:

        inode->block[block]

inode->BLOCK is a memory mapped region backed by a pager which
reorganizes this metadata into a simple array.  Thus, when we lookup
block, say, 1200, assuming a 4k block size, we have double
indirection.  The block we are looking for is pointed to by:

        inode->double_indirect_block[(1200 - 4096 / 4 - 12) / (4096 / 4)]
                                    [(1200 - 4096 / 4 - 12) % (4096 / 4)] 

or:

        inode->double_indirect_block[0][188] 

Under Roland's scheme, when we access inode->block[block] and it is
not faulted in, the pager backing inode->block[block] gets a
pager_read_page request for the aforementioned offset.  The pager sees
the offset and faults in the first page from the double indirect array
and then the first page from that.  Hence, we traverse three levels of
the paging hierarchy.  This all seems rather complicated, however,
there is a large win here: subsequent accesses will not undergo, as
currently happens, this procedure as long as the block array is in
core.  Currently, everytime we access a block, we must traverse the
entire hierarchy.  This way, the kernel functionally keeps a cache of
the calculation!


On the surface, Roland's technique has two major failings.  First, he
assumes that all metadata can easily be mapped into the task's address
space:

    A while back I did some off-the-cuff calculations of the ext2fs
    metadata.  I don't recall the numbers, but I satisfied myself that
    a workable amount of metadata (i.e. under a gigabyte or so) would
    cover up to terabytes of total filesystem data.

My calculations suggest that metadata accounts for, in the worst case
scenario, a bit more than 1/14 of the disk.  Let's assume the worst
case scenario: a full disk with the highest ratio of allocated
indirect blocks to file data blocks and ask, how large can the file
system possibly be without exhausting virtual memory?  An inode
structure is sblock->s_inode_size bytes large, however, we can
eliminate this abstraction as the only supported size is 128 bytes.
This structure lies in the inode table (metadata which we are easily
able to account for up front).  As well as containing the permissions
and mode also addresses 12 direct blocks, 1 indirect, 1 double
indirect and 1 triple indirect block.  The highest indirect block to
file block ratio is addressing 1 file block with the indirect block.
Here we have a ratio of 13 file blocks for one indirect block.  We can
never use just 2 indirect blocks, so let's not even consider this
case.  We can use three.  In this case, the 12 direct blocks are
allocated, the indirect block is filled as well as the double indirect
block and a single indirect block within that.  In this case, we need
10 + sblock->block_size / 4 (a disk block address is fixed to 4 bytes)
+ 1 file blocks.  sblock->block_size / 4 is either 256, 512 or 1024
(corresponding with the supported block sizes).  In the worst case,
using 1024 byte disk blocks, we have 12 + 256 + 1 file blocks per 3
indirect blocks.  This is a monotonically increasing function, hence,
we have found the worst metadata block to file data block ratio.
Let's assume all files use a single indirect block (i.e. each file
contains 13 blocks and therefore uses 1 indirect block).  We always
map the inode tables, the inode and block bitmaps and the group
descriptors.  mke2fs typically allocates one inode structure per two
disk blocks.  Our full disk will have then a significant number of
free inodes even though the disk blocks are all allocated.  There will
be total_blocks / 2 inode structures (total_inodes) or total_inodes /
128 blocks allocated to the inode tables plus, total_inodes /
block_size allocated for the inode bitmap and total_blocks /
block_size allocated to the block bitmap.  We also need to add in a
dozen or so blocks for the super block and the group descriptors, but
this is insignificant in comparison.  Plugging in some numbers, we see
that, in the worst case, a file system will be 1/13 metadata all of
which we have to map under Roland's scheme.  Assuming we have room for
2GB of metadata mappings, we could raise the limit to about 26GB.

But this is a fabricated case.  Instead, lets consider a real file
system, like my /home file system.  Here the pruned output of
`tune2fs -l':

        Inode count:              15433728
        Block count:              30862873
        Free blocks:              16923206
        Free inodes:              15420825
        Block size:               4096
        Fragment size:            4096
        Blocks per group:         32768
        Fragments per group:      32768
        Inodes per group:         16384
        Inode blocks per group:   512
        Inode size:               128

This is a 120GB partition.  As we can see, there are 15433728 inodes
and each inode is 128 bytes.  This yields a whopping 1.88GB of inode
tables.  Add to that the inode bitmap table (14MB) and the block
bitmaps (28MB) and we are at 1.93GB.  This is just the fixed metadata,
assuming everything aligns perfectly.  If we add to that the indirect
blocks, we will quickly exceed the 2GB virtual memory limit.

The proposal's second problem is the organization of the metadata:
when we look at the inode->block array above, exactly how are we to
organize it?  We know that we have the first 12 entries directly in
the inode.  We have to keep everything block aligned, so the only
solution that I can envision while still maintaining the aesthetically
pleasing array is to allocate a block of memory and stuffing the block
addresses to the last 12 * 4 bytes and then having the rest follow
naturally.  This would add more logic to the pager especially when
flushing data to backing store.  However, there is still some hair to
deal with when the block size is less than the page size: the indirect
block will coexist on the same page as the first 1 or 3 double
indirect blocks.  This is manageable, however, rather awkward.

A real solution would be to maintain some offsets as well as the array
and use them in conjunction with the level of indirection.  Thus
getblk becomes:

        if (block < 12)
          return inode->block[inode->direct_offset + block]
        else if (block < 12 + block_size)
          return inode->block[inode->indirect_offset + block - 12]
        etc.

This becomes really hairy when we consider block sizes less than
vm_page_size: the paging hierarchy begins to work against us.  If we
could use store_read, we could take (block % block_size != 1) and read
them into memory at the right offset, however, now, we must maintain
an array of offsets for doing a gather is impossible without
memcpy'ing the data.

Thanks,
Neal


_______________________________________________
Bug-hurd mailing list
[EMAIL PROTECTED]
http://lists.gnu.org/mailman/listinfo/bug-hurd

Review of Roland's >2GB ext2fs proposal

Reply via email to