Re: [zfs-discuss] Re: Big IOs overhead due to ZFS?

Matthew Ahrens Wed, 31 May 2006 11:11:49 -0700

There are a few related questions that I think you want answered.

1. How does RAID-Z effect performance?

When using RAID-Z, each filesystem block is spread across (typically)
all disks in the raid-z group.  So to a first approximation, each raid-z
group provides the iops of a single disk (but the bandwidth of N-1
disks).  See Roch's excellent article for a detailed explanation:

http://blogs.sun.com/roller/page/roch?entry=when_to_and_not_to

2. Why does 'zpool iostat' not report actual i/os?
3. Why are we doing so many read i/os?
4. Why are we reading so much data?

'zpool iostat' reports the i/os that are seen at each level of the vdev
tree, rather than the sum of the i/os that occur below that point in the
vdev tree.  This can provide some additional information when diagnosing
performance problems.  However, it is a bit counter-intuitive, so I
always use iostat(1m).  It may be clunky, but it does report on the
actual i/os issued to the hardware.  Also, I really like having the
%busy reading, which 'zpool iostat' does not provide.

We are doing lots of read i/os because each block is spread out across
all the disks in a raid-z group (as mentioned in Roch's article).
However, the "vdev cache" is causing us to issue many *fewer* i/os than
would seem to be required, but reading much *more* data.

For example, say we need to read a block of data.  We'll send the read
down to the raid-z vdev.  The raid-z vdev knows that it the data is
spread out over its disks, so it (essentially) issues one read zio_t to
each of the disk vdevs to retrieve the data.  Now each of those disk
vdevs will first look in its vdev cache.  If it finds the data there, it
returns it without ever actually issuing an i/o to the hardware.  If it
doesn't find it there, it will issue a 64k i/o to the hardware, and put
that 64k chunk into its vdev cache.

Without the vdev cache, we would simply issue (Number of blocks to read)
* (Number of disks in each raid-z vdev) read i/os to the hardware, and
read the total number of bytes that you would expect, since each of
those i/os would be for (approximately) 1/Ndisk bytes.  However, with
the vdev cache, we will issue fewer i/os, but read more data.

5. How can performance be improved?

A. Use one big pool.
Having 6 pools causes performance (and storage) to be stranded.  When
one filesystem is buiser than the others, it can only use the bandwidth
and iops of its single raid-z vdev.  If you had one big pool, that
filesystem would be able to use all the disks in your system.

B. Use smaller raid-z stripes.
As Roch's article explains, smaller raid-z stripes will provide more
iops.  We generally suggest 3 to 9 disks in each raid-z stripe.

C. Use higher-performance disks.  
I'm not sure what the underlying storage you're using is, but it's
pretty slow!  As you can see from your per-disk iostat output, each
device is only capable of 50-100 iops or 1-4MB/s, and takes on average
over 100ms to service a request.  If you are using some sort of hardware
RAID enclosure, it may be working against you here.  The perferred
configuration would be to have each disk appear as a single device to
the system.  (This should be possible even with fancy RAID hardware.)

So in conculsion, you can improve performance by creating one big pool
with several raid-z stripes, each with 3 to 9 disks in it.  These disks
should be actual physical disks.

Hope this helps,
--matt

ps. I'm drawing my conclusions based on the following data that you
provided:

On Wed, May 31, 2006 at 08:26:10AM -0700, Robert Milkowski wrote:
> That's interesting - 'zpool iostat' shows quite small read volume to
> any pool however if I run 'zpool iostat -v' then I can see that while
> read volume to a pool is small, read volume to each disk is actually
> quite large so in summary I get over 10x read volume if I sum all
> disks in a pool than on pool itself. These data are consistent with
> iostat. So now even zpool claims that it actually issues 10x (and
> more) read volume to all disks in a pool than to pool itself.
>
> Now - why???? It really hits performance here...

> The problem is that I do not see any heavy traffic on network
> interfaces nor using zpool iostat. However using just old iostat I can
> see MUCH more traffic going to local disks. The difference is someting
> like 10x - zpool iostat shows for example ~6MB/s of reads however
> iostat shows ~50MB/s. The question is who's lying?

>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>    57.6    2.0 3752.7   35.4  0.0  8.9    0.0  149.6   0  82 c4t...90E750d0
>    43.5    2.0 2846.9   35.4  0.0  5.3    0.0  116.4   0  66 c4t...90E7F0d0
>    56.6    2.0 3752.7   35.4  0.0  8.4    0.0  144.0   0  81 c4t...90E730d0
>    29.3   44.5 1558.9  212.8  0.0  5.0    0.0   67.1   0  96 c4t...92B540d0
>     9.1   65.7  582.3  313.9  0.0 11.2    0.0  149.4   0 100 c4t...8EDB20d0
>     0.0   78.9    0.0   75.3  0.0 35.0    0.0  443.8   0 100 c4t...9495A0d0

> bash-3.00# fsstat zfs 1
> [...]
>  new  name   name  attr  attr lookup rddir  read read  write write
>  file remov  chng   get   set    ops   ops   ops bytes   ops bytes
>    10    12     8   919     7    102     0    32  975K    26  652K zfs
>     6    21    10 1.22K     1    123     0   205 6.23M     4 33.5K zfs
>    14    26     3 1.14K     9    127     0    46 1.33M     5 60.1K zfs
>    13    11     8 1.02K     7    102     0    43 1.24M    22  514K zfs
>    10    17    10   998     6     87     0    31  746K    85 2.45M zfs
>    11    15     3   915    24     93     0    60 1.86M     6 54.3K zfs
>     7    31    19 1.82K     5    167     0    23  636K   278 8.22M zfs
>    14    22    13 1.44K    10    104     0    31  992K   257 7.84M zfs
>     5    18     5 1.16K     4     80     0    26  764K   262 8.06M zfs
>     1    19     6   572     2     75     0    19  579K     3 20.6K zfs

> bash-3.00# zpool iostat -v p1 1
>                               capacity     operations    bandwidth
> pool                        used  avail   read  write   read  write
> -------------------------  -----  -----  -----  -----  -----  -----
> p1                          749G  67.2G     58     90   878K   903K
>   raidz                     749G  67.2G     58     90   878K   903K
>     c4t500000E011909320d0      -      -     15     40   959K  87.3K
>     c4t500000E011909300d0      -      -     14     40   929K  86.5K
>     c4t500000E011903030d0      -      -     18     40  1.11M  86.8K
>     c4t500000E011903300d0      -      -     13     32   823K  77.7K
>     c4t500000E0119091E0d0      -      -     15     40   961K  87.3K
>     c4t500000E0119032D0d0      -      -     14     40   930K  86.5K
>     c4t500000E011903370d0      -      -     18     40  1.11M  86.8K
>     c4t500000E011903190d0      -      -     13     32   828K  77.8K
>     c4t500000E011903350d0      -      -     15     40   964K  87.3K
>     c4t500000E0119095A0d0      -      -     14     40   934K  86.5K
>     c4t500000E0119032A0d0      -      -     18     40  1.11M  86.8K
>     c4t500000E011903340d0      -      -     13     32   821K  77.7K
> -------------------------  -----  -----  -----  -----  -----  -----
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Big IOs overhead due to ZFS?

Reply via email to