Hello Matthew, Wednesday, May 31, 2006, 8:09:08 PM, you wrote:
MA> There are a few related questions that I think you want answered. MA> 1. How does RAID-Z effect performance? MA> When using RAID-Z, each filesystem block is spread across (typically) MA> all disks in the raid-z group. So to a first approximation, each raid-z MA> group provides the iops of a single disk (but the bandwidth of N-1 MA> disks). See Roch's excellent article for a detailed explanation: MA> http://blogs.sun.com/roller/page/roch?entry=when_to_and_not_to MA> 2. Why does 'zpool iostat' not report actual i/os? MA> 3. Why are we doing so many read i/os? MA> 4. Why are we reading so much data? MA> 'zpool iostat' reports the i/os that are seen at each level of the vdev MA> tree, rather than the sum of the i/os that occur below that point in the MA> vdev tree. This can provide some additional information when diagnosing MA> performance problems. However, it is a bit counter-intuitive, so I MA> always use iostat(1m). It may be clunky, but it does report on the MA> actual i/os issued to the hardware. Also, I really like having the MA> %busy reading, which 'zpool iostat' does not provide. MA> We are doing lots of read i/os because each block is spread out across MA> all the disks in a raid-z group (as mentioned in Roch's article). MA> However, the "vdev cache" is causing us to issue many *fewer* i/os than MA> would seem to be required, but reading much *more* data. MA> For example, say we need to read a block of data. We'll send the read MA> down to the raid-z vdev. The raid-z vdev knows that it the data is MA> spread out over its disks, so it (essentially) issues one read zio_t to MA> each of the disk vdevs to retrieve the data. Now each of those disk MA> vdevs will first look in its vdev cache. If it finds the data there, it MA> returns it without ever actually issuing an i/o to the hardware. If it MA> doesn't find it there, it will issue a 64k i/o to the hardware, and put MA> that 64k chunk into its vdev cache. MA> Without the vdev cache, we would simply issue (Number of blocks to read) MA> * (Number of disks in each raid-z vdev) read i/os to the hardware, and MA> read the total number of bytes that you would expect, since each of MA> those i/os would be for (approximately) 1/Ndisk bytes. However, with MA> the vdev cache, we will issue fewer i/os, but read more data. MA> 5. How can performance be improved? MA> A. Use one big pool. MA> Having 6 pools causes performance (and storage) to be stranded. When MA> one filesystem is buiser than the others, it can only use the bandwidth MA> and iops of its single raid-z vdev. If you had one big pool, that MA> filesystem would be able to use all the disks in your system. MA> B. Use smaller raid-z stripes. MA> As Roch's article explains, smaller raid-z stripes will provide more MA> iops. We generally suggest 3 to 9 disks in each raid-z stripe. MA> C. Use higher-performance disks. MA> I'm not sure what the underlying storage you're using is, but it's MA> pretty slow! As you can see from your per-disk iostat output, each MA> device is only capable of 50-100 iops or 1-4MB/s, and takes on average MA> over 100ms to service a request. If you are using some sort of hardware MA> RAID enclosure, it may be working against you here. The perferred MA> configuration would be to have each disk appear as a single device to MA> the system. (This should be possible even with fancy RAID hardware.) MA> So in conculsion, you can improve performance by creating one big pool MA> with several raid-z stripes, each with 3 to 9 disks in it. These disks MA> should be actual physical disks. MA> Hope this helps, That helps a lot - thank you. I wish I knew it before... Information Roch put on his blog should be explained both in MAN pages and ZFS Admin Guide - as this is something one would not expect. It actually means raid-z is useless in many enviroments compare to traditional raid-5. Now I use 3510 JBODs connected on two loops with MPxIO. Disks are 73GB 15K so they should be quite fast. Now I have to find out how to go away from raid-z... -- Best regards, Robert mailto:[EMAIL PROTECTED] http://milek.blogspot.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss