Re: [zfs-discuss] ZFS, XFS, and EXT4 compared

Richard Elling Thu, 30 Aug 2007 15:33:36 -0700

Jeffrey W. Baker wrote:
> # zfs set recordsize=2K tank/bench
> # randomio bigfile 10 .25 .01 2048 60 1
> 
>   total |  read:         latency (ms)       |  write:        latency (ms)
>    iops |   iops   min    avg    max   sdev |   iops   min    avg    max   
> sdev
> --------+-----------------------------------+----------------------------------
>   463.9 |  346.8   0.0   21.6  761.9   33.7 |  117.1   0.0   21.3  883.9   
> 33.5
> 
> Roughly the same as when the RS was 128K.  But, if I set the RS to 2K
> before creating bigfile:
> 
>   total |  read:         latency (ms)       |  write:        latency (ms)
>    iops |   iops   min    avg    max   sdev |   iops   min    avg    max   
> sdev
> --------+-----------------------------------+----------------------------------
>   614.7 |  460.4   0.0   18.5  249.3   14.2 |  154.4   0.0    9.6  989.0   
> 27.6
> 
> Much better!  Yay!  So I assume you would always set RS=8K when using
> PostgreSQL, etc?

I presume these are something like Seagate DB35.3 series SATA 400 GByte drives?
If so, then the spec'ed average read seek time is < 11 ms and rotational delay
is 7,200 rpm. So the theoretical peak random read rate per drive is ~66 iops.
http://www.seagate.com/ww/v/index.jsp?vgnextoid=01117ea70fafd010VgnVCM100000dd04090aRCRD&locale=en-US#

For an 8-disk mirrored set, the max theoretical random read rate is 527 iops.
I see you're getting 460, so you're at 87% of theoretical. Not bad.

When writing, the max theoretical rate is a little smaller because of the longer
seek time (see datasheet) so we can get ~62 iops per disk. Also, the total is
divided in half because we have to write to both sides of the mirror. Thus the
peak is 248 iops. You see 154 or 62% of peak. Not quite so good. But there is
another behaviour here which is peculiar to ZFS. All writes are COW and
allocated
from free space. But this is done in 1 MByte chunks. For 2 kByte I/Os, that
means
you need to get to a very high rate before the workload is spread out across
all of
the disks simultaneously. You should be able to see this if you look at iostat
with
a small interval. For 8 kByte recordsize, you should see that it is easier to
spread
the wealth across all 4 mirrored pairs. For other RAID systems, you can vary
the
stripe interlace, usually to much smaller values, to help spread the wealth.
It is
difficult to predict how this will affect your application performance, though.

For simultaneous reads and writes, 614 iops is pretty decent, but it makes me
wonder
if the spread is much smaller than the full disk.

If the application only does 8 kByte iops, then I wouldn't even bother doing
large,
sequential workload testing... you'll never be able to approach that limit
before you
run out of some other resource, usually CPU or controller.
-- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS, XFS, and EXT4 compared

Reply via email to