Howdy, Brent!

Thanks for your interest! We're pretty enthused about this project over here
and I'd be happy to share some details with you (and anyone else who cares
to peek). In this post I'll try to hit the major configuration
bullet-points, but I can also throw you command-line level specifics if you
want them.

1. The six Thumper iSCSI target nodes, and the iSCSI initiator head node,
all have a high-availability network configuration by marrying link
aggregation and IP multipathing techniques. Each machine has four 1GBe
interfaces and one 10GBe interface (we could have had two 10GBe interfaces,
but we decided to save some cash ;>). We link aggregate the four 1GBe
interfaces together to create a fatter 4GBe pipe, then we use IP
multipathing to group together the 10GBe interface and the 4GBe aggregation.
Through this we create a virtual "service IP" which can float back and
forth, automatically, between the two interfaces in the event of a network
path failure. The preferred home is the 10GBe interface, but if that dies
(or any part of its network path dies, like a switch somewhere down the
line), then the service IP migrates to the 4GBe aggregate (which is on a
completely separate network path) within four seconds. Whenever the 10GBe
interface is happy again, the service IP automatically migrates back to its
home.

2 The head node also has an Infiniband interface which plugs it into our HPC
compute cluster network, giving the cluster direct access to whatever
storage it needs.

3. All six iSCSI nodes have a redundant disk configuration using four ZFS
raidz2 groups, each containing 10 drives which are spread across five
controllers. Six additional disks, from a sixth controller, also live in the
pool as spares. This results in a 28.4TB data pool for each node that can
survive disk and controller failures.

4. Each of the six iSCSI nodes are presenting the entirety of their 28TB
pools through CHAP-authenticated iSCSI targets. (See
http://docs.sun.com/app/docs/doc/819-5461/gechv?a=view  for more info on
that process.)

5. The NAS nead node has wrangled up all six of the iSCSI targets (using
"iscsiadm add discovery-address ...") and joined them to create ~150TB
ofusable storage
(using "zpool create" against the devices created with iscsiadm). With that,
we've been able to carve up the storage into multiple ZFS filesystems, each
with its own recordsize, quota, permissions, NFS/CIFS shares, etc.

I think that about covers the high-level stuff. If there's any area you want
to dive deeper into, fire away!

-Gray

On Wed, Oct 15, 2008 at 1:29 AM, Brent Jones <[EMAIL PROTECTED]> wrote:

> On Tue, Oct 14, 2008 at 12:31 AM, Gray Carper <[EMAIL PROTECTED]> wrote:
> > Hey, all!
> >
> > We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI
> targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an
> x4200 head node. In trying to discover optimal ZFS pool construction
> settings, we've run a number of iozone tests, so I thought I'd share them
> with you and see if you have any comments, suggestions, etc.
> >
> > First, on a single Thumper, we ran baseline tests on the direct-attached
> storage (which is collected into a single ZFS pool comprised of four raidz2
> groups)...
> >
> > [1GB file size, 1KB record size]
> > Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest
> > Write: 123919
> > Rewrite: 146277
> > Read: 383226
> > Reread: 383567
> > Random Read: 84369
> > Random Write: 121617
> >
> > [8GB file size, 512KB record size]
> > Command:
> > Write:  373345
> > Rewrite:  665847
> > Read:  2261103
> > Reread:  2175696
> > Random Read:  2239877
> > Random Write:  666769
> >
> > [64GB file size, 1MB record size]
> > Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f
> /data-das/perftest/64gbtest
> > Write: 517092
> > Rewrite: 541768
> > Read: 682713
> > Reread: 697875
> > Random Read: 89362
> > Random Write: 488944
> >
> > These results look very nice, though you'll notice that the random read
> numbers tend to be pretty low on the 1GB and 64GB tests (relative to their
> sequential counterparts), but the 8GB random (and sequential) read is
> unbelievably good.
> >
> > Now we move to the head node's iSCSI aggregate ZFS pool...
> >
> > [1GB file size, 1KB record size]
> > Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f
> /volumes/data-iscsi/perftest/1gbtest
> > Write:  127108
> > Rewrite:  120704
> > Read:  394073
> > Reread:  396607
> > Random Read:  63820
> > Random Write:  5907
> >
> > [8GB file size, 512KB record size]
> > Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f
> /volumes/data-iscsi/perftest/8gbtest
> > Write:  235348
> > Rewrite:  179740
> > Read:  577315
> > Reread:  662253
> > Random Read:  249853
> > Random Write:  274589
> >
> > [64GB file size, 1MB record size]
> > Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f
> /volumes/data-iscsi/perftest/64gbtest
> > Write:  190535
> > Rewrite:  194738
> > Read:  297605
> > Reread:  314829
> > Random Read:  93102
> > Random Write:  175688
> >
> > Generally speaking, the results look good, but you'll notice that random
> writes are atrocious on the 1GB tests and random reads are not so great on
> the 1GB and 64GB tests, but the 8GB test looks great across the board.
> Voodoo! ;> Incidentally, I ran all these tests against the ZFS pool in disk,
> raidz1, and raidz2 modes - there were no significant changes in the results.
> >
> > So, how concerned should we be about the low scores here and there? Any
> suggestions on how to improve our configuration? And how excited should we
> be about the 8GB tests? ;>
> >
> > Thanks so much for any input you have!
> > -Gray
> > ---
> > University of Michigan
> > Medical School Information Services
> > --
> > This message posted from opensolaris.org
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >
>
> Your setup sounds very interesting how you export iSCSI to another
> head unit, can you give me some more details on your file system
> layout, and how you mount it on the head unit?
> Sounds like a pretty clever way to export awesomely large volumes!
>
> Regards,
>
> --
> Brent Jones
> [EMAIL PROTECTED]
>



-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
[EMAIL PROTECTED]  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to