Oops - one thing I meant to mention: We only plan to cross-site replicate data for those folks who require it. The HPC data crunching would have no use for it, so that filesystem wouldn't be replicated. In reality, we only expect a select few users, with relatively small filesystems, to actually need replication. (Which begs the question: Why build an identical 150TB system to support that? Good question. I think we'll reevaluate. ;>)
-Gray On Thu, Oct 16, 2008 at 3:50 PM, Gray Carper <[EMAIL PROTECTED]> wrote: > Howdy! > > Very valuable advice here (and from Bob, who made similar comments - > thanks, Bob!). I think, then, we'll generally stick to 128K recordsizes. In > the case of databases, we'll stray as appropriate, and we may also stray > with the HPC compute cluster if we can get demonstrate that it is worth it. > > To answer your questions below... > > Currently, we have a single pool, in a "load share" configuration (no > raidz), that collects all the storage (which answers Ross' question too). > From that we carve filesystems on demand. There are many more tests planned > for that construction, though, so we are not married to it. > > Redundancy abounds. ;> Since the pool doesn't employ raidz, it isn't > internally redundant, but we plan to replicate the pool's data to an > identical system (which is not yet built) at another site. Our initial > userbase don't need the replication, however, because they uses the system > for little more than scratch space. Huge genomic datasets are dumped on the > storage, analyzed, and the results (which are much smaller) get sent > elsewhere. Everything is wiped out soon after that and the process starts > again. Future projected uses of the storage, however, would be far less > tolerant of loss, so I expect we'll want to reconfigure the pool in raidz. > > I see that Archie and Miles have shared some harrowing concerns which we > take very seriously. I don't think I'll be able to reply to them today, but > I certainly will in the near future (particularly once we've completed some > more of our induced failure scenarios). > > Sidenote: Today we made eight network/iSCSI related tweaks that, in > aggregate, have resulted in dramatic performance improvements (some I just > hadn't gotten around to yet, others suggested by Sun's Mertol Ozyoney)... > > - disabling the Nagle algorithm on the head node > - setting each iSCSI target block size to match the ZFS record size of 128K > - disabling "thin provisioning" on the iSCSI targets > - enabling jumbo frames everywhere (each switch and NIC) > - raising ddi_msix_alloc_limit to 8 > - raising ip_soft_rings_cnt to 16 > - raising tcp_deferred_acks_max to 16 > - raising tcp_local_dacks_max to 16 > > Rerunning the same tests, we now see... > > [1GB file size, 1KB record size] > Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest > Write: 143373 > Rewrite: 183170 > Read: 433205 > Reread: 435503 > Random Read: 90118 > Random Write: 19488 > > [8GB file size, 512KB record size] > Command: iozone -i 0 -i 1 -i 2 -r 512k -s 8g -f > /volumes/data-iscsi/perftest/8gbtest > Write: 463260 > Rewrite: 449280 > Read: 1092291 > Reread: 881044 > Random Read: 442565 > Random Write: 565565 > > [64GB file size, 1MB record size] > Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest > Write: 357199 > Rewrite: 342788 > Read: 609553 > Reread: 645618 > Random Read: 218874 > Random Write: 339624 > > Thanks so much to everyone for all their great contributions! > -Gray > > > On Thu, Oct 16, 2008 at 2:20 AM, Akhilesh Mritunjai < > [EMAIL PROTECTED] <[EMAIL PROTECTED]>> wrote: > >> Hi Gray, >> >> You've got a nice setup going there, few comments: >> >> 1. Do not tune ZFS without a proven test-case to show otherwise, except... >> 2. For databases. Tune recordsize for that particular FS to match DB >> recordsize. >> >> Few questions... >> >> * How are you divvying up the space ? >> * How are you taking care of redundancy ? >> * Are you aware that each layer of ZFS needs its own redundancy ? >> >> Since you have got a mixed use case here, I would be surprized if a >> general config would cover all, though it might do with some luck. >> -- >> This message posted from opensolaris.org >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > > > -- > Gray Carper > MSIS Technical Services > University of Michigan Medical School > [EMAIL PROTECTED] | skype: graycarper | 734.418.8506 > http://www.umms.med.umich.edu/msis/ > -- Gray Carper MSIS Technical Services University of Michigan Medical School [EMAIL PROTECTED] | skype: graycarper | 734.418.8506 http://www.umms.med.umich.edu/msis/
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss