Howdy, Brent! Thanks for your interest! We're pretty enthused about this project over here and I'd be happy to share some details with you (and anyone else who cares to peek). In this post I'll try to hit the major configuration bullet-points, but I can also throw you command-line level specifics if you want them.
1. The six Thumper iSCSI target nodes, and the iSCSI initiator head node, all have a high-availability network configuration by marrying link aggregation and IP multipathing techniques. Each machine has four 1GBe interfaces and one 10GBe interface (we could have had two 10GBe interfaces, but we decided to save some cash ;>). We link aggregate the four 1GBe interfaces together to create a fatter 4GBe pipe, then we use IP multipathing to group together the 10GBe interface and the 4GBe aggregation. Through this we create a virtual "service IP" which can float back and forth, automatically, between the two interfaces in the event of a network path failure. The preferred home is the 10GBe interface, but if that dies (or any part of its network path dies, like a switch somewhere down the line), then the service IP migrates to the 4GBe aggregate (which is on a completely separate network path) within four seconds. Whenever the 10GBe interface is happy again, the service IP automatically migrates back to its home. 2 The head node also has an Infiniband interface which plugs it into our HPC compute cluster network, giving the cluster direct access to whatever storage it needs. 3. All six iSCSI nodes have a redundant disk configuration using four ZFS raidz2 groups, each containing 10 drives which are spread across five controllers. Six additional disks, from a sixth controller, also live in the pool as spares. This results in a 28.4TB data pool for each node that can survive disk and controller failures. 4. Each of the six iSCSI nodes are presenting the entirety of their 28TB pools through CHAP-authenticated iSCSI targets. (See http://docs.sun.com/app/docs/doc/819-5461/gechv?a=view for more info on that process.) 5. The NAS nead node has wrangled up all six of the iSCSI targets (using "iscsiadm add discovery-address ...") and joined them to create ~150TB ofusable storage (using "zpool create" against the devices created with iscsiadm). With that, we've been able to carve up the storage into multiple ZFS filesystems, each with its own recordsize, quota, permissions, NFS/CIFS shares, etc. I think that about covers the high-level stuff. If there's any area you want to dive deeper into, fire away! -Gray On Wed, Oct 15, 2008 at 1:29 AM, Brent Jones <[EMAIL PROTECTED]> wrote: > On Tue, Oct 14, 2008 at 12:31 AM, Gray Carper <[EMAIL PROTECTED]> wrote: > > Hey, all! > > > > We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI > targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an > x4200 head node. In trying to discover optimal ZFS pool construction > settings, we've run a number of iozone tests, so I thought I'd share them > with you and see if you have any comments, suggestions, etc. > > > > First, on a single Thumper, we ran baseline tests on the direct-attached > storage (which is collected into a single ZFS pool comprised of four raidz2 > groups)... > > > > [1GB file size, 1KB record size] > > Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest > > Write: 123919 > > Rewrite: 146277 > > Read: 383226 > > Reread: 383567 > > Random Read: 84369 > > Random Write: 121617 > > > > [8GB file size, 512KB record size] > > Command: > > Write: 373345 > > Rewrite: 665847 > > Read: 2261103 > > Reread: 2175696 > > Random Read: 2239877 > > Random Write: 666769 > > > > [64GB file size, 1MB record size] > > Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f > /data-das/perftest/64gbtest > > Write: 517092 > > Rewrite: 541768 > > Read: 682713 > > Reread: 697875 > > Random Read: 89362 > > Random Write: 488944 > > > > These results look very nice, though you'll notice that the random read > numbers tend to be pretty low on the 1GB and 64GB tests (relative to their > sequential counterparts), but the 8GB random (and sequential) read is > unbelievably good. > > > > Now we move to the head node's iSCSI aggregate ZFS pool... > > > > [1GB file size, 1KB record size] > > Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f > /volumes/data-iscsi/perftest/1gbtest > > Write: 127108 > > Rewrite: 120704 > > Read: 394073 > > Reread: 396607 > > Random Read: 63820 > > Random Write: 5907 > > > > [8GB file size, 512KB record size] > > Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f > /volumes/data-iscsi/perftest/8gbtest > > Write: 235348 > > Rewrite: 179740 > > Read: 577315 > > Reread: 662253 > > Random Read: 249853 > > Random Write: 274589 > > > > [64GB file size, 1MB record size] > > Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f > /volumes/data-iscsi/perftest/64gbtest > > Write: 190535 > > Rewrite: 194738 > > Read: 297605 > > Reread: 314829 > > Random Read: 93102 > > Random Write: 175688 > > > > Generally speaking, the results look good, but you'll notice that random > writes are atrocious on the 1GB tests and random reads are not so great on > the 1GB and 64GB tests, but the 8GB test looks great across the board. > Voodoo! ;> Incidentally, I ran all these tests against the ZFS pool in disk, > raidz1, and raidz2 modes - there were no significant changes in the results. > > > > So, how concerned should we be about the low scores here and there? Any > suggestions on how to improve our configuration? And how excited should we > be about the 8GB tests? ;> > > > > Thanks so much for any input you have! > > -Gray > > --- > > University of Michigan > > Medical School Information Services > > -- > > This message posted from opensolaris.org > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > Your setup sounds very interesting how you export iSCSI to another > head unit, can you give me some more details on your file system > layout, and how you mount it on the head unit? > Sounds like a pretty clever way to export awesomely large volumes! > > Regards, > > -- > Brent Jones > [EMAIL PROTECTED] > -- Gray Carper MSIS Technical Services University of Michigan Medical School [EMAIL PROTECTED] | skype: graycarper | 734.418.8506 http://www.umms.med.umich.edu/msis/
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss