On Tue, Dec 21, 2010 at 7:58 AM, Jeff Bacon <ba...@walleyesoftware.com>wrote:
> One thing I've been confused about for a long time is the relationship > between ZFS, the ARC, and the page cache. > > We have an application that's a quasi-database. It reads files by > mmap()ing them. (writes are done via write()). We're talking 100TB of > data in files that are 100k->50G in size (the files have headers to tell > the app what segment to map, so mapped chunks are in the 100k->50M > range, though sometimes it's sequential.) > > I found it confusing that we ended up having to allocate a ton of swap > to back anon pages behind all the mmap()ing. We never write to an > mmap()ed space, so we don't ever write to swap, so it's not a huge deal, > but it's curious. > > Since others have already commented on the rest of this, I will note you can use the MAP_NORESERVE flag with mmap to prevent that behavior (which if the mmap'ed data is being altered, shouldn't cause any issues). In the old days of UFS, there was the page cache. You mmap()ed a file, > it was allocated a range in your VM space, the pager paged in files on > demand via VFS/UFS. UFS had a block cache, but it was only about big > enough to deal with queueing; the page cache was the main cache. > > ZFS seems to break this model - the paging system and page cache is > still there, but then there's this ARC layer (and L2ARC layer) > underneath it. If I read the concept right, it seems to work better with > a world where users read()/write(), and all the caching is done within > the ARC and the page cache exists primarily for process heap, and so the > ARC expands and contracts as necessary to stay out of the way of process > heap requirements but otherwise caching happens in ARC space. > > If I'm following this, what we're doing is essentially duplicating - > files exist in the page cache, but then they also exist in the ARC, and > since the ARC is in kernel space and I presume that the VM subsystem > doesn't know that a page that happens to be in the page cache is > actually in the ARC as well. > > As a result of this line of thinking, I've tuned the box such that the > ARC is relatively small (10G out of 96), and is only caching metadata, > with piles of L2ARC behind it, assuming that the page cache page is the > one I need, letting the pager deal with what to keep in and out of RAM, > and leaning on the I/O subsystem to make up for it. > > (This sounds less terrible than you think - the machine has 90 dual-port > SAS-2 spindles across 6 LSI controllers with 12 x4 uplinks off the > expanders, no daisy-chain, with OCZ Vertex2Pro L2ARCs. I can push > 5GByte/sec on/off disk all day without sweating hard.) > > Is my line of thinking valid, or am I missing something? > > Thanks, > -bacon > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss