queuing theory should explain this rather nicely.  iostat measures
%busy by counting if there is an entry in the queue for the clock
ticks.  There are two queues, one in the controller and one on the
disk.  As you can clearly see the way ZFS pushes the load is very
different than dd or UFS.
 -- richard

Marko Milisavljevic wrote:
I am very grateful to everyone who took the time to run a few tests to help me figure what is going on. As per j's suggestions, I tried some simultaneous reads, and a few other things, and I am getting interesting and confusing results.

All tests are done using two Seagate 320G drives on sil3114. In each test I am using dd if=.... of=/dev/null bs=128k count=10000. Each drive is freshly formatted with one 2G file copied to it. That way dd from raw disk and from file are using roughly same area of disk. I tried using raw, zfs and ufs, single drives and two simultaneously (just executing dd commands in separate terminal windows). These are snapshots of iostat -xnczpm 3 captured somewhere in the middle of the operation. I am not bothering to report CPU% as it never rose over 50%, and was uniformly proportional to reported throughput.

single drive raw:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 1378.4    0.0 77190.7    0.0  0.0  1.7    0.0    1.2   0  98 c0d1

single drive, ufs file
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 1255.1    0.0 69949.6    0.0  0.0  1.8    0.0    1.4   0 100 c0d0

Small slowdown, but pretty good.

single drive, zfs file
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  258.3     0.0 33066.6    0.0 33.0  2.0  127.7    7.7 100 100 c0d1

Now that is odd. Why so much waiting? Also, unlike with raw or UFS, kr/s / r/s gives 256K, as I would imagine it should.

simultaneous raw:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  797.0    0.0 44632.0    0.0  0.0  1.8    0.0    2.3   0 100 c0d0
  795.7    0.0 44557.4    0.0  0.0  1.8    0.0    2.3   0 100 c0d1

This PCI interface seems to be saturated at 90MB/s. Adequate if the goal is to serve files on gigabit SOHO network.

sumultaneous raw on c0d1 and ufs on c0d0:
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
  722.4    0.0 40246.8    0.0  0.0   1.8    0.0    2.5   0 100 c0d0
  717.1    0.0 40156.2    0.0  0.0  1.8    0.0    2.5   0  99 c0d1

hmm, can no longer get the 90MB/sec.

simultaneous zfs on c0d1 and raw on c0d0:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.7    0.0    1.8  0.0  0.0    0.0    0.1   0   0 c1d0
  334.9    0.0 18756.0    0.0  0.0  1.9    0.0    5.5   0  97 c0d0
  172.5    0.0 22074.6    0.0 33.0  2.0  191.3   11.6 100 100 c0d1

Everything is slow.

What happens if we throw onboard IDE interface into the mix?
simultaneous raw SATA and raw PATA:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 1036.3    0.3 58033.9    0.3  0.0  1.6    0.0    1.6   0  99 c1d0
 1422.6    0.0 79668.3    0.0  0.0  1.6    0.0    1.1   1  98 c0d0

Both at maximum throughput.

Read ZFS on SATA drive and raw disk on PATA interface:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 1018.9    0.3 57056.1    4.0  0.0  1.7    0.0    1.7   0  99 c1d0
  268.4    0.0 34353.1     0.0 33.0  2.0  122.9    7.5 100 100 c0d0

SATA is slower with ZFS as expected by now, but ATA remains at full speed. So they are operating quite independantly. Except...

What if we read a UFS file from the PATA disk and ZFS from SATA:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  792.8    0.0 44092.9    0.0  0.0  1.8    0.0    2.2   1  98 c1d0
  224.0    0.0 28675.2    0.0 33.0  2.0  147.3    8.9 100 100 c0d0
Now that is confusing! Why did SATA/ZFS slow down too? I've retried this a number of times, not a fluke.

Finally, after reviewing all this, I've noticed another interesting bit... whenever I read from raw disks or UFS files, SATA or PATA, kr/s over r/s is 56k, suggesting that underlying IO system is using that as some kind of a native block size? (even though dd is requesting 128k). But when reading ZFS files, this always comes to 128k, which is expected, since that is ZFS default (and same thing happens regardless of bs= in dd). On the theory that my system just doesn't like 128k reads (I'm desperate!), and that this would explain the whole slowdown and wait/wsvc_t column, I tried changing recsize to 32k and rewriting the test file. However, accessing ZFS files continues to show 128k reads, and it is just as slow. Is there a way to either confirm that the ZFS file in question is indeed written with 32k records or, even better, to force ZFS to use 56k when accessing the disk. Or perhaps I just misunderstand implications of iostat output.

I've repeated each of these tests a few times and doublechecked, and the numbers, although snapshots of a point in time, fairly represent averages.

I have no idea what to make of all this, except that it ZFS has a problem with this hardware/drivers that UFS and other traditional file systems, don't. Is it a bug in the driver that ZFS is inadvertently exposing? A specific feature that ZFS assumes the hardware to have, but it doesn't? Who knows! I will have to give up on Solaris/ZFS on this hardware for now, but I hope to try it again sometime in the future. I'll give FreeBSD/ZFS a spin to see if it fares better (although at this point in its development it is probably more risky then just sticking with Linux and missing out on ZFS).

(Another contributor suggested turning checksumming off - it made no difference. Same for atime. Compression was always off.)

On 5/14/07, * [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:

    Marko,

    I tried this experiment again using 1 disk and got nearly identical
    times:

    # /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=10000
    10000+0 records in
    10000+0 records out

    real       21.4
    user        0.0
    sys         2.4

    $ /usr/bin/time dd if=/test/filebench/testfile of=/dev/null bs=128k
    count=10000
    10000+0 records in
    10000+0 records out

    real       21.0
    user         0.0
    sys         0.7


     > [I]t is not possible for dd to meaningfully access multiple-disk
     > configurations without going through the file system. I find it
     > curious that there is such a large slowdown by going through file
     > system (with single drive configuration), especially compared to UFS
     > or ext3.

    Comparing a filesystem to raw dd access isn't a completely fair
    comparison either.  Few filesystems actually layout all of their data
    and metadata so that every read is a completely sequential read.

     > I simply have a small SOHO server and I am trying to evaluate
    which OS to
     > use to keep a redundant disk array. With unreliable
    consumer-level hardware,
     > ZFS and the checksum feature are very interesting and the primary
    selling
     > point compared to a Linux setup, for as long as ZFS can generate
    enough
     > bandwidth from the drive array to saturate single gigabit ethernet.

    I would take Bart's reccomendation and go with Solaris on something
    like a
    dual-core box with 4 disks.

     > My hardware at the moment is the "wrong" choice for Solaris/ZFS -
    PCI 3114
     > SATA controller on a 32-bit AthlonXP, according to many posts I
    found.

    Bill Moore lists some controller reccomendations here:

    http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html
    <http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html>

     > However, since dd over raw disk is capable of extracting 75+MB/s
    from this
     > setup, I keep feeling that surely I must be able to get at least
    that much
     > from reading a pair of striped or mirrored ZFS drives. But I
    can't - single
     > drive or 2-drive stripes or mirrors, I only get around 34MB/s
    going through
     > ZFS. (I made sure mirror was rebuilt and I resilvered the stripes.)

    Maybe this is a problem with your controller?  What happens when you
    have two simultaneous dd's to different disks running?  This would
    simulate the case where you're reading from the two disks at the same
    time.

    -j



------------------------------------------------------------------------

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to