One thing I've been confused about for a long time is the relationship
between ZFS, the ARC, and the page cache. 

We have an application that's a quasi-database. It reads files by
mmap()ing them. (writes are done via write()). We're talking 100TB of
data in files that are 100k->50G in size (the files have headers to tell
the app what segment to map, so mapped chunks are in the 100k->50M
range, though sometimes it's sequential.) 

I found it confusing that we ended up having to allocate a ton of swap
to back anon pages behind all the mmap()ing. We never write to an
mmap()ed space, so we don't ever write to swap, so it's not a huge deal,
but it's curious. 

In the old days of UFS, there was the page cache. You mmap()ed a file,
it was allocated a range in your VM space, the pager paged in files on
demand via VFS/UFS. UFS had a block cache, but it was only about big
enough to deal with queueing; the page cache was the main cache. 

ZFS seems to break this model - the paging system and page cache is
still there, but then there's this ARC layer (and L2ARC layer)
underneath it. If I read the concept right, it seems to work better with
a world where users read()/write(), and all the caching is done within
the ARC and the page cache exists primarily for process heap, and so the
ARC expands and contracts as necessary to stay out of the way of process
heap requirements but otherwise caching happens in ARC space.

If I'm following this, what we're doing is essentially duplicating -
files exist in the page cache, but then they also exist in the ARC, and
since the ARC is in kernel space and I presume that the VM subsystem
doesn't know that a page that happens to be in the page cache is
actually in the ARC as well. 

As a result of this line of thinking, I've tuned the box such that the
ARC is relatively small (10G out of 96), and is only caching metadata,
with piles of L2ARC behind it, assuming that the page cache page is the
one I need, letting the pager deal with what to keep in and out of RAM,
and leaning on the I/O subsystem to make up for it. 

(This sounds less terrible than you think - the machine has 90 dual-port
SAS-2 spindles across 6 LSI controllers with 12 x4 uplinks off the
expanders, no daisy-chain, with OCZ Vertex2Pro L2ARCs. I can push
5GByte/sec on/off disk all day without sweating hard.) 

Is my line of thinking valid, or am I missing something? 

Thanks,
-bacon
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to