Oops - one thing I meant to mention: We only plan to cross-site replicate
data for those folks who require it. The HPC data crunching would have no
use for it, so that filesystem wouldn't be replicated. In reality, we only
expect a select few users, with relatively small filesystems, to actually
need replication. (Which begs the question: Why build an identical 150TB
system to support that? Good question. I think we'll reevaluate. ;>)

-Gray

On Thu, Oct 16, 2008 at 3:50 PM, Gray Carper <[EMAIL PROTECTED]> wrote:

> Howdy!
>
> Very valuable advice here (and from Bob, who made similar comments -
> thanks, Bob!). I think, then, we'll generally stick to 128K recordsizes. In
> the case of databases, we'll stray as appropriate, and we may also stray
> with the HPC compute cluster if we can get demonstrate that it is worth it.
>
> To answer your questions below...
>
> Currently, we have a single pool, in a "load share" configuration (no
> raidz), that collects all the storage (which answers Ross' question too).
> From that we carve filesystems on demand. There are many more tests planned
> for that construction, though, so we are not married to it.
>
> Redundancy abounds. ;> Since the pool doesn't employ raidz, it isn't
> internally redundant, but we plan to replicate the pool's data to an
> identical system (which is not yet built) at another site. Our initial
> userbase don't need the replication, however, because they uses the system
> for little more than scratch space. Huge genomic datasets are dumped on the
> storage, analyzed, and the results (which are much smaller) get sent
> elsewhere. Everything is wiped out soon after that and the process starts
> again. Future projected uses of the storage, however, would be far less
> tolerant of loss, so I expect we'll want to reconfigure the pool in raidz.
>
> I see that Archie and Miles have shared some harrowing concerns which we
> take very seriously. I don't think I'll be able to reply to them today, but
> I certainly will in the near future (particularly once we've completed some
> more of our induced failure scenarios).
>
> Sidenote: Today we made eight network/iSCSI related tweaks that, in
> aggregate, have resulted in dramatic performance improvements (some I just
> hadn't gotten around to yet, others suggested by Sun's Mertol Ozyoney)...
>
> - disabling the Nagle algorithm on the head node
> - setting each iSCSI target block size to match the ZFS record size of 128K
> - disabling "thin provisioning" on the iSCSI targets
> - enabling jumbo frames everywhere (each switch and NIC)
> - raising ddi_msix_alloc_limit to 8
> - raising ip_soft_rings_cnt to 16
> - raising tcp_deferred_acks_max to 16
> - raising tcp_local_dacks_max to 16
>
> Rerunning the same tests, we now see...
>
> [1GB file size, 1KB record size]
> Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest
> Write: 143373
> Rewrite: 183170
> Read: 433205
> Reread: 435503
> Random Read: 90118
> Random Write: 19488
>
> [8GB file size, 512KB record size]
> Command: iozone -i 0 -i 1 -i 2 -r 512k -s 8g -f
> /volumes/data-iscsi/perftest/8gbtest
> Write:  463260
> Rewrite:  449280
> Read:  1092291
> Reread:  881044
> Random Read:  442565
> Random Write:  565565
>
> [64GB file size, 1MB record size]
> Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest
> Write: 357199
> Rewrite: 342788
> Read: 609553
> Reread: 645618
> Random Read: 218874
> Random Write: 339624
>
> Thanks so much to everyone for all their great contributions!
> -Gray
>
>
> On Thu, Oct 16, 2008 at 2:20 AM, Akhilesh Mritunjai <
> [EMAIL PROTECTED] <[EMAIL PROTECTED]>> wrote:
>
>> Hi Gray,
>>
>> You've got a nice setup going there, few comments:
>>
>> 1. Do not tune ZFS without a proven test-case to show otherwise, except...
>> 2. For databases. Tune recordsize for that particular FS to match DB
>> recordsize.
>>
>> Few questions...
>>
>> * How are you divvying up the space ?
>> * How are you taking care of redundancy ?
>> * Are you aware that each layer of ZFS needs its own redundancy ?
>>
>> Since you have got a mixed use case here, I would be surprized if a
>> general config would cover all, though it might do with some luck.
>> --
>> This message posted from opensolaris.org
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
>
>
> --
> Gray Carper
> MSIS Technical Services
> University of Michigan Medical School
> [EMAIL PROTECTED]  |  skype:  graycarper  |  734.418.8506
> http://www.umms.med.umich.edu/msis/
>



-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
[EMAIL PROTECTED]  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to