On Thu, Apr 26, 2007 at 05:48:12PM +1000, Nick Piggin wrote: > Christoph Lameter wrote: > >On Thu, 26 Apr 2007, Nick Piggin wrote: > > > > > >>No I don't want to add another fs layer. > > > > > >Well maybe you could explain what you want. Preferably without redefining > >the established terms? > > Support for larger buffers than page cache pages.
The problem with this approach is that it turns around the whole way we look at bufferheads. Right now we have well defined 1:n mapping of page to bufferheads and so we tpyically lock the page first them iterate all the bufferheads on the page. Going the other way, we need to support m:n which we means the buffer has to become the primary interface for the filesystem to the page cache. i.e. we need to lock the bufferhead first, then iterate all the pages on it. This is messy because the cache indexes via pages, not bufferheads. hence a buffer needs to point to all the pages in it explicitly, and this leads to interesting issues with locking. If you still think that this is a good idea, I suggest that you spend a bit of time looking at fs/xfs/linux-2.6/xfs_buf.c, because that is *exactly* what this does - it is a multi-page buffer interface on top of a block device address space radix tree. This cache is the reason that XFS was so easy to transition to large block sizes (i only needed to convert the data path). However, this approach has some serious problems: - need to index buffers so that lookups can be done on buffer before page - completely different locking is required - needs memory allocation to hold more than 4 pages - needs vmap() rather than kmap_atomic() for mapping multi-page buffers - I/O needs to be issued based on buffers, not pages - needs it's own flush code - does not interface with memory reclaim well IOWs, we need to turn every filesystem completely upside down to make work with this sort of large page infrastructure, not to mention the rest of the VM (mmap, page reclaim, etc). It's back to the bad ol' days of buffer caches again and we don't want to go back there. Compared to a buffer based implementation, the high order page cache is a picture of elegance and refined integration. It is an evolutionary step, not a disconnect, from what we have now.... > >Because 4k is a good page size that is bound to the binary format? Frankly > >there is no point in having my text files in large page sizes. However, > >when I read a dvd then I may want to transfer 64k chunks or when use my > >flash drive I may want to transfer 128k chunks. And yes if a scientific > >application needs to do data dump then it should be able to use very high > >page sizes (megabytes, gigabytes) to be able to continue its work while > >the huge dumps runs at full I/O speed ... > > So block size > page cache size... also, you should obviously be using > hardware that is tuned to work well with 4K pages, because surely there > is lots of that around. The CPU hardware works well with 4k pages, but in general I/O hardware works more efficiently as the numbers of s/g entries they require drops for a given I/O size. Given that we limit drivers to 128 s/g entries, we really aren't using I/O hardware to it's full potential or at it's most efficient by limiting each s/g entry to a single 4k page. And FWIW, a having a buffer for block size > page size does not solve this problem - only contiguous page allocation solves this problem. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/