All, I would like to take a moment to examine Thomas's proposal for fixing the 2GB ext2fs limitation and highlight what I view as potential short comings in it as well as some noteworthy parts. The only email I can find where he elaborates his plan in any detail is:
http://lists.gnu.org/archive/html/hurd-devel/2002-03/msg00035.html Thomas, like Roland, would continue to use a single memory object (or, rather, array of 4GB memory objects) to span the entire disk. However, rather than mapping the whole disk as we do now, we would only map what is required at any given time (and maintain some cache of mappings). Unlike Roland, Thomas does not have multiple metadata pagers. The metadata pager remains ignorant of its content and no effort is made to build a simple array out of the inode table or block list for a given file, etc. The result is that the file block look up logic remains as is (i.e. in the front end and not integrated into the pagers). In order to access metadata, we continue to use accessor functions. Rather than applying a simple offset into a one-to-one mapping, a cache of active mappings is consulted. If a mapped region is found which contains the page, it is returned. Otherwise, a region containing the page is mapped into the address space, stored in the cache and returned. Regions may be as small as a page or several megabytes in size. This parameter would need tuning based on empirical data. As I explained in my mail reviewing Roland's proposal, metadata represents approximately 3% of the file system. As such, it is imperative that we provide a mechanism to drain the mapping cache and keep it at a reasonable size. (This is less important on small file systems even though indirect blocks are part of the main block pool and given sufficient time, all blocks could have been potentially used as indirect blocks. However, indirect blocks can be special cased: when they are freed, we remove any mapping straight away.) In the very least then, we need a list of the mappings. It is insufficient, however, to wait until vm_map fails to drain the cache as the address space is shared with entities (i.e. anything which uses vm_allocate or vm_map) which are unaware of the cache and would cause the program to fail miserably on what in reality is a soft error. Unmapping cannot be haphazard: we cannot release a mapping as long as there are users which expect the address to refer to specific disk blocks. Therefore, we need to reference count the regions and which means having each block accessor paired with a release function. This system is clearly rather fragile, yet, I see no simpler alternative. Happily however, the locations where this must be done have been identified by Ogi and that knowledge is easily transferable to any other system. We must also consider what the eviction mechanism will look like. Thomas makes no suggestions of which I am aware. If we just evict mappings when the reference count drops to zero, the only mappings which will profit from the cache are pages being accessed concurrently. Although I have done no experiments to suggest that this is never the case, we need only consider a sequential read of a file to realize that it is often not the case: a client sends a request of X blocks to the file system. The server replies and then, after the client processes the returned blocks, the client asks for more. Clearly, the inode will be consulted again. This scenario would have elided a vm_unmap and vm_map had the mapping remained in the cache. Given this, I see a strong theoretical motivation to make cache entries more persistent. If we make cache entries semi-persistent, a mechanism needs to be in place to drain the cache when it is full. The easiest place to trigger the draining is just before a new mapping is created. But how do we find the best region to evict? The information we are able to account for is: when a given region is mapped into the address space, the last time the kernel requested pages from a region, and when the last references to a region were added. Except for the last bit, this information is already in the kernel. One way we could take advantage of this is to use the pager_notify_eviction mechanism which Ogi is using and I described in a previous email [1]. If the kernel does not have a copy (and there are no exant user references), then the page likely makes a good eviction candidate. This data can be augmented by the amount of recent references in conjunction with a standard clock againg algorithm. But really, that final bit is unnecessary: once the kernel has dropped a page, the only way we can get the data back is by reading it from disk making an extra vm_unmap and vm_map rather cheap. Strictly following this offers another advantage: the cache data in the file system remains propostional to the amount of data cached in the kernel. This, it seems to me, is a good arguement to keep the region size equal to vm_page_size, as I have in my proposal. So far, Thomas's proposal is strikingly similar to what I have suggested [2,3] (or rather, my proposal is strikingly similar to his). The major difference lies in what we are caching: Thomas has a large memory object of which small parts are mapped into the address space; I have a single small memory object of which the entire contents are mapped into the address space. Thomas multiplexes the address space keeping the contents of the memory object fixed; I multiplex the contents of the memory object keeping the address space fixed. In my proposal, the mapping database is only in the task and not also in Mach. More concretely, we both require two hashes: Thomas hashes file system blocks to address space mappings and vice versa; I hash file system blocks to address space locations and vice verse. So, Thomas has a lots of small mappings in the address space which are associated with disk blocks; I track the contents of a single large mapping. The advantage in my proposal, I believe, is that it is much easier on Mach (as far as I understand Mach's internals; I am sure that Thomas has a much better insight into Mach's machinery and I hope will confirm this as either perceived advantage as either a fantasy or a reality): with only one mapping, Mach uses less memory. Since we both require two hashes anyway, my mapping database then consumes no additional memory. Hopefully, I have given an accurate representation of Thomas's proposal. If I have anything wrong, I hope you will point it out so that we can find the closest approximation of the ideal fix for this problem. Thanks, Neal [1] http://lists.gnu.org/archive/html/bug-hurd/2004-08/msg00005.html [2] http://lists.gnu.org/archive/html/bug-hurd/2002-12/msg00055.html [3] http://lists.gnu.org/archive/html/bug-hurd/2003-05/msg00024.html _______________________________________________ Bug-hurd mailing list [EMAIL PROTECTED] http://lists.gnu.org/mailman/listinfo/bug-hurd