Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

Richard Elling Fri, 18 May 2007 15:19:08 -0700

queuing theory should explain this rather nicely.  iostat measures
%busy by counting if there is an entry in the queue for the clock
ticks.  There are two queues, one in the controller and one on the
disk.  As you can clearly see the way ZFS pushes the load is very
different than dd or UFS.
 -- richard


Marko Milisavljevic wrote:

I am very grateful to everyone who took the time to run a few tests tohelp me figure what is going on. As per j's suggestions, I tried somesimultaneous reads, and a few other things, and I am getting interestingand confusing results.
All tests are done using two Seagate 320G drives on sil3114. In eachtest I am using dd if=.... of=/dev/null bs=128k count=10000. Each driveis freshly formatted with one 2G file copied to it. That way dd from rawdisk and from file are using roughly same area of disk. I tried usingraw, zfs and ufs, single drives and two simultaneously (just executingdd commands in separate terminal windows). These are snapshots of iostat-xnczpm 3 captured somewhere in the middle of the operation. I am notbothering to report CPU% as it never rose over 50%, and was uniformlyproportional to reported throughput.
single drive raw:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 1378.4    0.0 77190.7    0.0  0.0  1.7    0.0    1.2   0  98 c0d1

single drive, ufs file
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 1255.1    0.0 69949.6    0.0  0.0  1.8    0.0    1.4   0 100 c0d0

Small slowdown, but pretty good.

single drive, zfs file
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  258.3     0.0 33066.6    0.0 33.0  2.0  127.7    7.7 100 100 c0d1
Now that is odd. Why so much waiting? Also, unlike with raw or UFS, kr/s/ r/s gives 256K, as I would imagine it should.
simultaneous raw:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  797.0    0.0 44632.0    0.0  0.0  1.8    0.0    2.3   0 100 c0d0
  795.7    0.0 44557.4    0.0  0.0  1.8    0.0    2.3   0 100 c0d1
This PCI interface seems to be saturated at 90MB/s. Adequate if the goalis to serve files on gigabit SOHO network.
sumultaneous raw on c0d1 and ufs on c0d0:
extended device statisticsr/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
  722.4    0.0 40246.8    0.0  0.0   1.8    0.0    2.5   0 100 c0d0
  717.1    0.0 40156.2    0.0  0.0  1.8    0.0    2.5   0  99 c0d1

hmm, can no longer get the 90MB/sec.

simultaneous zfs on c0d1 and raw on c0d0:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.7    0.0    1.8  0.0  0.0    0.0    0.1   0   0 c1d0
  334.9    0.0 18756.0    0.0  0.0  1.9    0.0    5.5   0  97 c0d0
  172.5    0.0 22074.6    0.0 33.0  2.0  191.3   11.6 100 100 c0d1

Everything is slow.

What happens if we throw onboard IDE interface into the mix?
simultaneous raw SATA and raw PATA:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 1036.3    0.3 58033.9    0.3  0.0  1.6    0.0    1.6   0  99 c1d0
 1422.6    0.0 79668.3    0.0  0.0  1.6    0.0    1.1   1  98 c0d0

Both at maximum throughput.

Read ZFS on SATA drive and raw disk on PATA interface:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 1018.9    0.3 57056.1    4.0  0.0  1.7    0.0    1.7   0  99 c1d0
  268.4    0.0 34353.1     0.0 33.0  2.0  122.9    7.5 100 100 c0d0
SATA is slower with ZFS as expected by now, but ATA remains at fullspeed. So they are operating quite independantly. Except...
What if we read a UFS file from the PATA disk and ZFS from SATA:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  792.8    0.0 44092.9    0.0  0.0  1.8    0.0    2.2   1  98 c1d0
  224.0    0.0 28675.2    0.0 33.0  2.0  147.3    8.9 100 100 c0d0
Now that is confusing! Why did SATA/ZFS slow down too? I've retried thisa number of times, not a fluke.
Finally, after reviewing all this, I've noticed another interestingbit... whenever I read from raw disks or UFS files, SATA or PATA, kr/sover r/s is 56k, suggesting that underlying IO system is using that assome kind of a native block size? (even though dd is requesting 128k).But when reading ZFS files, this always comes to 128k, which isexpected, since that is ZFS default (and same thing happens regardlessof bs= in dd). On the theory that my system just doesn't like 128k reads(I'm desperate!), and that this would explain the whole slowdown andwait/wsvc_t column, I tried changing recsize to 32k and rewriting thetest file. However, accessing ZFS files continues to show 128k reads,and it is just as slow. Is there a way to either confirm that the ZFSfile in question is indeed written with 32k records or, even better, toforce ZFS to use 56k when accessing the disk. Or perhaps I justmisunderstand implications of iostat output.
I've repeated each of these tests a few times and doublechecked, and thenumbers, although snapshots of a point in time, fairly represent averages.
I have no idea what to make of all this, except that it ZFS has aproblem with this hardware/drivers that UFS and other traditional filesystems, don't. Is it a bug in the driver that ZFS is inadvertentlyexposing? A specific feature that ZFS assumes the hardware to have, butit doesn't? Who knows! I will have to give up on Solaris/ZFS on thishardware for now, but I hope to try it again sometime in the future.I'll give FreeBSD/ZFS a spin to see if it fares better (although at thispoint in its development it is probably more risky then just stickingwith Linux and missing out on ZFS).
(Another contributor suggested turning checksumming off - it made nodifference. Same for atime. Compression was always off.)
On 5/14/07, * [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>*<[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
    Marko,

    I tried this experiment again using 1 disk and got nearly identical
    times:

    # /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=10000
    10000+0 records in
    10000+0 records out

    real       21.4
    user        0.0
    sys         2.4

    $ /usr/bin/time dd if=/test/filebench/testfile of=/dev/null bs=128k
    count=10000
    10000+0 records in
    10000+0 records out

    real       21.0
    user         0.0
    sys         0.7


     > [I]t is not possible for dd to meaningfully access multiple-disk
     > configurations without going through the file system. I find it
     > curious that there is such a large slowdown by going through file
     > system (with single drive configuration), especially compared to UFS
     > or ext3.

    Comparing a filesystem to raw dd access isn't a completely fair
    comparison either.  Few filesystems actually layout all of their data
    and metadata so that every read is a completely sequential read.

     > I simply have a small SOHO server and I am trying to evaluate
    which OS to
     > use to keep a redundant disk array. With unreliable
    consumer-level hardware,
     > ZFS and the checksum feature are very interesting and the primary
    selling
     > point compared to a Linux setup, for as long as ZFS can generate
    enough
     > bandwidth from the drive array to saturate single gigabit ethernet.

    I would take Bart's reccomendation and go with Solaris on something
    like a
    dual-core box with 4 disks.

     > My hardware at the moment is the "wrong" choice for Solaris/ZFS -
    PCI 3114
     > SATA controller on a 32-bit AthlonXP, according to many posts I
    found.

    Bill Moore lists some controller reccomendations here:

    http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html
    <http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html>

     > However, since dd over raw disk is capable of extracting 75+MB/s
    from this
     > setup, I keep feeling that surely I must be able to get at least
    that much
     > from reading a pair of striped or mirrored ZFS drives. But I
    can't - single
     > drive or 2-drive stripes or mirrors, I only get around 34MB/s
    going through
     > ZFS. (I made sure mirror was rebuilt and I resilvered the stripes.)

    Maybe this is a problem with your controller?  What happens when you
    have two simultaneous dd's to different disks running?  This would
    simulate the case where you're reading from the two disks at the same
    time.

    -j



------------------------------------------------------------------------

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

Reply via email to