There are a few related questions that I think you want answered. 1. How does RAID-Z effect performance?
When using RAID-Z, each filesystem block is spread across (typically) all disks in the raid-z group. So to a first approximation, each raid-z group provides the iops of a single disk (but the bandwidth of N-1 disks). See Roch's excellent article for a detailed explanation: http://blogs.sun.com/roller/page/roch?entry=when_to_and_not_to 2. Why does 'zpool iostat' not report actual i/os? 3. Why are we doing so many read i/os? 4. Why are we reading so much data? 'zpool iostat' reports the i/os that are seen at each level of the vdev tree, rather than the sum of the i/os that occur below that point in the vdev tree. This can provide some additional information when diagnosing performance problems. However, it is a bit counter-intuitive, so I always use iostat(1m). It may be clunky, but it does report on the actual i/os issued to the hardware. Also, I really like having the %busy reading, which 'zpool iostat' does not provide. We are doing lots of read i/os because each block is spread out across all the disks in a raid-z group (as mentioned in Roch's article). However, the "vdev cache" is causing us to issue many *fewer* i/os than would seem to be required, but reading much *more* data. For example, say we need to read a block of data. We'll send the read down to the raid-z vdev. The raid-z vdev knows that it the data is spread out over its disks, so it (essentially) issues one read zio_t to each of the disk vdevs to retrieve the data. Now each of those disk vdevs will first look in its vdev cache. If it finds the data there, it returns it without ever actually issuing an i/o to the hardware. If it doesn't find it there, it will issue a 64k i/o to the hardware, and put that 64k chunk into its vdev cache. Without the vdev cache, we would simply issue (Number of blocks to read) * (Number of disks in each raid-z vdev) read i/os to the hardware, and read the total number of bytes that you would expect, since each of those i/os would be for (approximately) 1/Ndisk bytes. However, with the vdev cache, we will issue fewer i/os, but read more data. 5. How can performance be improved? A. Use one big pool. Having 6 pools causes performance (and storage) to be stranded. When one filesystem is buiser than the others, it can only use the bandwidth and iops of its single raid-z vdev. If you had one big pool, that filesystem would be able to use all the disks in your system. B. Use smaller raid-z stripes. As Roch's article explains, smaller raid-z stripes will provide more iops. We generally suggest 3 to 9 disks in each raid-z stripe. C. Use higher-performance disks. I'm not sure what the underlying storage you're using is, but it's pretty slow! As you can see from your per-disk iostat output, each device is only capable of 50-100 iops or 1-4MB/s, and takes on average over 100ms to service a request. If you are using some sort of hardware RAID enclosure, it may be working against you here. The perferred configuration would be to have each disk appear as a single device to the system. (This should be possible even with fancy RAID hardware.) So in conculsion, you can improve performance by creating one big pool with several raid-z stripes, each with 3 to 9 disks in it. These disks should be actual physical disks. Hope this helps, --matt ps. I'm drawing my conclusions based on the following data that you provided: On Wed, May 31, 2006 at 08:26:10AM -0700, Robert Milkowski wrote: > That's interesting - 'zpool iostat' shows quite small read volume to > any pool however if I run 'zpool iostat -v' then I can see that while > read volume to a pool is small, read volume to each disk is actually > quite large so in summary I get over 10x read volume if I sum all > disks in a pool than on pool itself. These data are consistent with > iostat. So now even zpool claims that it actually issues 10x (and > more) read volume to all disks in a pool than to pool itself. > > Now - why???? It really hits performance here... > The problem is that I do not see any heavy traffic on network > interfaces nor using zpool iostat. However using just old iostat I can > see MUCH more traffic going to local disks. The difference is someting > like 10x - zpool iostat shows for example ~6MB/s of reads however > iostat shows ~50MB/s. The question is who's lying? > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 57.6 2.0 3752.7 35.4 0.0 8.9 0.0 149.6 0 82 c4t...90E750d0 > 43.5 2.0 2846.9 35.4 0.0 5.3 0.0 116.4 0 66 c4t...90E7F0d0 > 56.6 2.0 3752.7 35.4 0.0 8.4 0.0 144.0 0 81 c4t...90E730d0 > 29.3 44.5 1558.9 212.8 0.0 5.0 0.0 67.1 0 96 c4t...92B540d0 > 9.1 65.7 582.3 313.9 0.0 11.2 0.0 149.4 0 100 c4t...8EDB20d0 > 0.0 78.9 0.0 75.3 0.0 35.0 0.0 443.8 0 100 c4t...9495A0d0 > bash-3.00# fsstat zfs 1 > [...] > new name name attr attr lookup rddir read read write write > file remov chng get set ops ops ops bytes ops bytes > 10 12 8 919 7 102 0 32 975K 26 652K zfs > 6 21 10 1.22K 1 123 0 205 6.23M 4 33.5K zfs > 14 26 3 1.14K 9 127 0 46 1.33M 5 60.1K zfs > 13 11 8 1.02K 7 102 0 43 1.24M 22 514K zfs > 10 17 10 998 6 87 0 31 746K 85 2.45M zfs > 11 15 3 915 24 93 0 60 1.86M 6 54.3K zfs > 7 31 19 1.82K 5 167 0 23 636K 278 8.22M zfs > 14 22 13 1.44K 10 104 0 31 992K 257 7.84M zfs > 5 18 5 1.16K 4 80 0 26 764K 262 8.06M zfs > 1 19 6 572 2 75 0 19 579K 3 20.6K zfs > bash-3.00# zpool iostat -v p1 1 > capacity operations bandwidth > pool used avail read write read write > ------------------------- ----- ----- ----- ----- ----- ----- > p1 749G 67.2G 58 90 878K 903K > raidz 749G 67.2G 58 90 878K 903K > c4t500000E011909320d0 - - 15 40 959K 87.3K > c4t500000E011909300d0 - - 14 40 929K 86.5K > c4t500000E011903030d0 - - 18 40 1.11M 86.8K > c4t500000E011903300d0 - - 13 32 823K 77.7K > c4t500000E0119091E0d0 - - 15 40 961K 87.3K > c4t500000E0119032D0d0 - - 14 40 930K 86.5K > c4t500000E011903370d0 - - 18 40 1.11M 86.8K > c4t500000E011903190d0 - - 13 32 828K 77.8K > c4t500000E011903350d0 - - 15 40 964K 87.3K > c4t500000E0119095A0d0 - - 14 40 934K 86.5K > c4t500000E0119032A0d0 - - 18 40 1.11M 86.8K > c4t500000E011903340d0 - - 13 32 821K 77.7K > ------------------------- ----- ----- ----- ----- ----- ----- _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss