Hello Matthew,

Wednesday, May 31, 2006, 8:09:08 PM, you wrote:

MA> There are a few related questions that I think you want answered.

MA> 1. How does RAID-Z effect performance?

MA> When using RAID-Z, each filesystem block is spread across (typically)
MA> all disks in the raid-z group.  So to a first approximation, each raid-z
MA> group provides the iops of a single disk (but the bandwidth of N-1
MA> disks).  See Roch's excellent article for a detailed explanation:

MA> http://blogs.sun.com/roller/page/roch?entry=when_to_and_not_to

MA> 2. Why does 'zpool iostat' not report actual i/os?
MA> 3. Why are we doing so many read i/os?
MA> 4. Why are we reading so much data?

MA> 'zpool iostat' reports the i/os that are seen at each level of the vdev
MA> tree, rather than the sum of the i/os that occur below that point in the
MA> vdev tree.  This can provide some additional information when diagnosing
MA> performance problems.  However, it is a bit counter-intuitive, so I
MA> always use iostat(1m).  It may be clunky, but it does report on the
MA> actual i/os issued to the hardware.  Also, I really like having the
MA> %busy reading, which 'zpool iostat' does not provide.

MA> We are doing lots of read i/os because each block is spread out across
MA> all the disks in a raid-z group (as mentioned in Roch's article).
MA> However, the "vdev cache" is causing us to issue many *fewer* i/os than
MA> would seem to be required, but reading much *more* data.

MA> For example, say we need to read a block of data.  We'll send the read
MA> down to the raid-z vdev.  The raid-z vdev knows that it the data is
MA> spread out over its disks, so it (essentially) issues one read zio_t to
MA> each of the disk vdevs to retrieve the data.  Now each of those disk
MA> vdevs will first look in its vdev cache.  If it finds the data there, it
MA> returns it without ever actually issuing an i/o to the hardware.  If it
MA> doesn't find it there, it will issue a 64k i/o to the hardware, and put
MA> that 64k chunk into its vdev cache.

MA> Without the vdev cache, we would simply issue (Number of blocks to read)
MA> * (Number of disks in each raid-z vdev) read i/os to the hardware, and
MA> read the total number of bytes that you would expect, since each of
MA> those i/os would be for (approximately) 1/Ndisk bytes.  However, with
MA> the vdev cache, we will issue fewer i/os, but read more data.

MA> 5. How can performance be improved?

MA> A. Use one big pool.
MA> Having 6 pools causes performance (and storage) to be stranded.  When
MA> one filesystem is buiser than the others, it can only use the bandwidth
MA> and iops of its single raid-z vdev.  If you had one big pool, that
MA> filesystem would be able to use all the disks in your system.

MA> B. Use smaller raid-z stripes.
MA> As Roch's article explains, smaller raid-z stripes will provide more
MA> iops.  We generally suggest 3 to 9 disks in each raid-z stripe.

MA> C. Use higher-performance disks.  
MA> I'm not sure what the underlying storage you're using is, but it's
MA> pretty slow!  As you can see from your per-disk iostat output, each
MA> device is only capable of 50-100 iops or 1-4MB/s, and takes on average
MA> over 100ms to service a request.  If you are using some sort of hardware
MA> RAID enclosure, it may be working against you here.  The perferred
MA> configuration would be to have each disk appear as a single device to
MA> the system.  (This should be possible even with fancy RAID hardware.)

MA> So in conculsion, you can improve performance by creating one big pool
MA> with several raid-z stripes, each with 3 to 9 disks in it.  These disks
MA> should be actual physical disks.

MA> Hope this helps,

That helps a lot - thank you.
I wish I knew it before... Information Roch put on his blog should be
explained both in MAN pages and ZFS Admin Guide - as this is something
one would not expect.

It actually means raid-z is useless in many enviroments compare to
traditional raid-5.

Now I use 3510 JBODs connected on two loops with MPxIO.
Disks are 73GB 15K so they should be quite fast.

Now I have to find out how to go away from raid-z...

-- 
Best regards,
 Robert                            mailto:[EMAIL PROTECTED]
                                       http://milek.blogspot.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to