Howdy! Very valuable advice here (and from Bob, who made similar comments - thanks, Bob!). I think, then, we'll generally stick to 128K recordsizes. In the case of databases, we'll stray as appropriate, and we may also stray with the HPC compute cluster if we can get demonstrate that it is worth it.
To answer your questions below... Currently, we have a single pool, in a "load share" configuration (no raidz), that collects all the storage (which answers Ross' question too). >From that we carve filesystems on demand. There are many more tests planned for that construction, though, so we are not married to it. Redundancy abounds. ;> Since the pool doesn't employ raidz, it isn't internally redundant, but we plan to replicate the pool's data to an identical system (which is not yet built) at another site. Our initial userbase don't need the replication, however, because they uses the system for little more than scratch space. Huge genomic datasets are dumped on the storage, analyzed, and the results (which are much smaller) get sent elsewhere. Everything is wiped out soon after that and the process starts again. Future projected uses of the storage, however, would be far less tolerant of loss, so I expect we'll want to reconfigure the pool in raidz. I see that Archie and Miles have shared some harrowing concerns which we take very seriously. I don't think I'll be able to reply to them today, but I certainly will in the near future (particularly once we've completed some more of our induced failure scenarios). Sidenote: Today we made eight network/iSCSI related tweaks that, in aggregate, have resulted in dramatic performance improvements (some I just hadn't gotten around to yet, others suggested by Sun's Mertol Ozyoney)... - disabling the Nagle algorithm on the head node - setting each iSCSI target block size to match the ZFS record size of 128K - disabling "thin provisioning" on the iSCSI targets - enabling jumbo frames everywhere (each switch and NIC) - raising ddi_msix_alloc_limit to 8 - raising ip_soft_rings_cnt to 16 - raising tcp_deferred_acks_max to 16 - raising tcp_local_dacks_max to 16 Rerunning the same tests, we now see... [1GB file size, 1KB record size] Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest Write: 143373 Rewrite: 183170 Read: 433205 Reread: 435503 Random Read: 90118 Random Write: 19488 [8GB file size, 512KB record size] Command: iozone -i 0 -i 1 -i 2 -r 512k -s 8g -f /volumes/data-iscsi/perftest/8gbtest Write: 463260 Rewrite: 449280 Read: 1092291 Reread: 881044 Random Read: 442565 Random Write: 565565 [64GB file size, 1MB record size] Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest Write: 357199 Rewrite: 342788 Read: 609553 Reread: 645618 Random Read: 218874 Random Write: 339624 Thanks so much to everyone for all their great contributions! -Gray On Thu, Oct 16, 2008 at 2:20 AM, Akhilesh Mritunjai < [EMAIL PROTECTED] <[EMAIL PROTECTED]>> wrote: > Hi Gray, > > You've got a nice setup going there, few comments: > > 1. Do not tune ZFS without a proven test-case to show otherwise, except... > 2. For databases. Tune recordsize for that particular FS to match DB > recordsize. > > Few questions... > > * How are you divvying up the space ? > * How are you taking care of redundancy ? > * Are you aware that each layer of ZFS needs its own redundancy ? > > Since you have got a mixed use case here, I would be surprized if a general > config would cover all, though it might do with some luck. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Gray Carper MSIS Technical Services University of Michigan Medical School [EMAIL PROTECTED] | skype: graycarper | 734.418.8506 http://www.umms.med.umich.edu/msis/
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss