queuing theory should explain this rather nicely. iostat measures
%busy by counting if there is an entry in the queue for the clock
ticks. There are two queues, one in the controller and one on the
disk. As you can clearly see the way ZFS pushes the load is very
different than dd or UFS.
-- richard
Marko Milisavljevic wrote:
I am very grateful to everyone who took the time to run a few tests to
help me figure what is going on. As per j's suggestions, I tried some
simultaneous reads, and a few other things, and I am getting interesting
and confusing results.
All tests are done using two Seagate 320G drives on sil3114. In each
test I am using dd if=.... of=/dev/null bs=128k count=10000. Each drive
is freshly formatted with one 2G file copied to it. That way dd from raw
disk and from file are using roughly same area of disk. I tried using
raw, zfs and ufs, single drives and two simultaneously (just executing
dd commands in separate terminal windows). These are snapshots of iostat
-xnczpm 3 captured somewhere in the middle of the operation. I am not
bothering to report CPU% as it never rose over 50%, and was uniformly
proportional to reported throughput.
single drive raw:
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
1378.4 0.0 77190.7 0.0 0.0 1.7 0.0 1.2 0 98 c0d1
single drive, ufs file
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
1255.1 0.0 69949.6 0.0 0.0 1.8 0.0 1.4 0 100 c0d0
Small slowdown, but pretty good.
single drive, zfs file
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
258.3 0.0 33066.6 0.0 33.0 2.0 127.7 7.7 100 100 c0d1
Now that is odd. Why so much waiting? Also, unlike with raw or UFS, kr/s
/ r/s gives 256K, as I would imagine it should.
simultaneous raw:
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
797.0 0.0 44632.0 0.0 0.0 1.8 0.0 2.3 0 100 c0d0
795.7 0.0 44557.4 0.0 0.0 1.8 0.0 2.3 0 100 c0d1
This PCI interface seems to be saturated at 90MB/s. Adequate if the goal
is to serve files on gigabit SOHO network.
sumultaneous raw on c0d1 and ufs on c0d0:
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
722.4 0.0 40246.8 0.0 0.0 1.8 0.0 2.5 0 100 c0d0
717.1 0.0 40156.2 0.0 0.0 1.8 0.0 2.5 0 99 c0d1
hmm, can no longer get the 90MB/sec.
simultaneous zfs on c0d1 and raw on c0d0:
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.7 0.0 1.8 0.0 0.0 0.0 0.1 0 0 c1d0
334.9 0.0 18756.0 0.0 0.0 1.9 0.0 5.5 0 97 c0d0
172.5 0.0 22074.6 0.0 33.0 2.0 191.3 11.6 100 100 c0d1
Everything is slow.
What happens if we throw onboard IDE interface into the mix?
simultaneous raw SATA and raw PATA:
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
1036.3 0.3 58033.9 0.3 0.0 1.6 0.0 1.6 0 99 c1d0
1422.6 0.0 79668.3 0.0 0.0 1.6 0.0 1.1 1 98 c0d0
Both at maximum throughput.
Read ZFS on SATA drive and raw disk on PATA interface:
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
1018.9 0.3 57056.1 4.0 0.0 1.7 0.0 1.7 0 99 c1d0
268.4 0.0 34353.1 0.0 33.0 2.0 122.9 7.5 100 100 c0d0
SATA is slower with ZFS as expected by now, but ATA remains at full
speed. So they are operating quite independantly. Except...
What if we read a UFS file from the PATA disk and ZFS from SATA:
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
792.8 0.0 44092.9 0.0 0.0 1.8 0.0 2.2 1 98 c1d0
224.0 0.0 28675.2 0.0 33.0 2.0 147.3 8.9 100 100 c0d0
Now that is confusing! Why did SATA/ZFS slow down too? I've retried this
a number of times, not a fluke.
Finally, after reviewing all this, I've noticed another interesting
bit... whenever I read from raw disks or UFS files, SATA or PATA, kr/s
over r/s is 56k, suggesting that underlying IO system is using that as
some kind of a native block size? (even though dd is requesting 128k).
But when reading ZFS files, this always comes to 128k, which is
expected, since that is ZFS default (and same thing happens regardless
of bs= in dd). On the theory that my system just doesn't like 128k reads
(I'm desperate!), and that this would explain the whole slowdown and
wait/wsvc_t column, I tried changing recsize to 32k and rewriting the
test file. However, accessing ZFS files continues to show 128k reads,
and it is just as slow. Is there a way to either confirm that the ZFS
file in question is indeed written with 32k records or, even better, to
force ZFS to use 56k when accessing the disk. Or perhaps I just
misunderstand implications of iostat output.
I've repeated each of these tests a few times and doublechecked, and the
numbers, although snapshots of a point in time, fairly represent averages.
I have no idea what to make of all this, except that it ZFS has a
problem with this hardware/drivers that UFS and other traditional file
systems, don't. Is it a bug in the driver that ZFS is inadvertently
exposing? A specific feature that ZFS assumes the hardware to have, but
it doesn't? Who knows! I will have to give up on Solaris/ZFS on this
hardware for now, but I hope to try it again sometime in the future.
I'll give FreeBSD/ZFS a spin to see if it fares better (although at this
point in its development it is probably more risky then just sticking
with Linux and missing out on ZFS).
(Another contributor suggested turning checksumming off - it made no
difference. Same for atime. Compression was always off.)
On 5/14/07, * [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>*
<[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
Marko,
I tried this experiment again using 1 disk and got nearly identical
times:
# /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out
real 21.4
user 0.0
sys 2.4
$ /usr/bin/time dd if=/test/filebench/testfile of=/dev/null bs=128k
count=10000
10000+0 records in
10000+0 records out
real 21.0
user 0.0
sys 0.7
> [I]t is not possible for dd to meaningfully access multiple-disk
> configurations without going through the file system. I find it
> curious that there is such a large slowdown by going through file
> system (with single drive configuration), especially compared to UFS
> or ext3.
Comparing a filesystem to raw dd access isn't a completely fair
comparison either. Few filesystems actually layout all of their data
and metadata so that every read is a completely sequential read.
> I simply have a small SOHO server and I am trying to evaluate
which OS to
> use to keep a redundant disk array. With unreliable
consumer-level hardware,
> ZFS and the checksum feature are very interesting and the primary
selling
> point compared to a Linux setup, for as long as ZFS can generate
enough
> bandwidth from the drive array to saturate single gigabit ethernet.
I would take Bart's reccomendation and go with Solaris on something
like a
dual-core box with 4 disks.
> My hardware at the moment is the "wrong" choice for Solaris/ZFS -
PCI 3114
> SATA controller on a 32-bit AthlonXP, according to many posts I
found.
Bill Moore lists some controller reccomendations here:
http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html
<http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html>
> However, since dd over raw disk is capable of extracting 75+MB/s
from this
> setup, I keep feeling that surely I must be able to get at least
that much
> from reading a pair of striped or mirrored ZFS drives. But I
can't - single
> drive or 2-drive stripes or mirrors, I only get around 34MB/s
going through
> ZFS. (I made sure mirror was rebuilt and I resilvered the stripes.)
Maybe this is a problem with your controller? What happens when you
have two simultaneous dd's to different disks running? This would
simulate the case where you're reading from the two disks at the same
time.
-j
------------------------------------------------------------------------
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss