On Thu, Oct 16, 2008 at 6:52 AM, Tomas Ögren <[EMAIL PROTECTED]> wrote: > On 16 October, 2008 - Ross sent me these 1,1K bytes: > >> I might be misunderstanding here, but I don't see how you're going to >> improve on "zfs set primarycache=metadata". >> >> You complain that ZFS throws away 96kb of data if you're only reading >> 32kb at a time, but then also complain that you are IO/s bound and >> that this is restricting your maximum transfer rate. If it's io/s >> that is limiting you it makes no difference that ZFS is throwing away >> 96kb of data, you're going to get the same iops and same throughput at >> your application whether you're using 32k or 128k zfs record sizes. > > But with 1Gb FC, if I'm reading 100MB/s it matters if 100MB or 25MB/s of > those are actually used for something.. > >> Also, you're asking on one hand for each disk to get larger IO blocks, >> and on the other you are complaining that with large block sizes a lot >> of data is wasted. > > .. if I turn off data caching (and only leave metadata caching on). > >> That looks like a contradictory argument to me as you can't you're >> asking have both of these. You just need to pick whichever one is >> more suited to your needs. >> >> Like I said, I may be misunderstanding, but I think you might be >> looking for something that you don't actually need. > > Ok. ZFS prefetch can help, but I don't want it to use up all my ram for > data cache.. Using it for small temporary buffers while reading stuff > from disk is good, but once data has been read from disk and used once > (delivered over NFS), there is a very low probability that I will need > it again before it has been flushed (because 4TB > 8GB). > > With default tuning, ZFS will keep stacking up these "use once" data > blocks in the cache, pushing out metadata cache which actually has a > good chance of being used again (metadata for all our files can fit in > the 8GB of ram, but 4TB of data can't). > > So if I could tell ZFS, "Here you have 512M (or whatever) of ARC space > that you can use for prefetch etc. Leave the other 7.5GB of ram for > metadata cache." >
I see where you are going with this - but it looks like your performance limit is more likely IOPS (I/O Ops/Sec) and latency, rather than disk subsystem bandwidth. You want ZFS to know which blocks to read to satisfy a request for the next "chunk" of a file (to be downloaded) so that each I/O operation reads data that you need, every time. But ZFS is filling up your cache with data blocks - which you are unlikely to re-use. If you set the ZFS caches for metadata only, you'll probably find that you're still not getting enough IOPS. The way you maximize IOPS is to use multiway mirrors for your data. For example, if you have a 5-way mirror composed of five 15k RPM drives, you'll see 5 * 700 [1] IOPS - and my *guess* is that you'll need about 2,500 IOPS to "busy out" a reasonably powerful (in terms of CPU) NFS server "feeding" a gigabit ethernet port. So what if some of those IO ops are being used to traverse metadata to get to the data blocks .... if you can do 3.5k IOPS, and you need 2.5k IOPS, you can "afford" to "waste" some of them because you don't have all the metadata cached. I strongly suspect, that if you talked one of the ZFS developers into cooking you up an experimental version of ZFS to do as you ask above, that you would still not get the IOPS and system response (low latency) you need to get the real work done. The "correct" solution, aside from a multi-way mirror disk config, (and I know you don't way to hear this,) is to equip your NFS server with 32Gb (or more) of RAM. You simply need a server style motherboard with 16 DIMM slots and inexpensive (Kingston) RAM at approx $23/gigabyte. Any server grade motherboard with one fast multi-core CPU will get the job done here. ZFS is designed to scale beautifully with more RAM, hence, your most viable solution is to use it the way the designers intended it to be used. Another point comes to mind here while thinking about the disk drives. We have three basic categories of disk drives with the following, broad, operational characteristics: a) inexpensive, large capacity SATA drives running at 7,200 RPM and providing, approximately, 300 IOPS. b) expensive, small capacity, SAS drives running at 15k RPM and providing, approx, 700 IOPS. c) SSD - currently not available with the cost per gigabyte to make them viable for your application, but capable of 3.5k+ IOPS And you need large (inexpensive) capacity but high IOPS - which you can only get from multi-way mirror configs. Solutions: 1) a multi-way mirror config of RAIDZ SATA disk drive based pools (lots of drives, lots of power) 2) a multi-way mirror config of WD VelociRaptor 10k RPM drives (the version that fits in a 2.5" bay is part # WD3000BLFS) I would strongly consider option 2) above; take a look at the capacity and IOPS available from this drive. Regards, -- Al Hopper Logical Approach Inc,Plano,TX [EMAIL PROTECTED] Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss