On May 30, 2006, at 2:16 PM, Richard Elling wrote:
[assuming we're talking about disks and not "hardware RAID arrays"...]
It'd be interesting to know how many customers plan to use raw disks,
and how their performance relates to hardware arrays. (My gut feeling
is that a lot of disks on FC probably isn't too bad, though on parallel
SCSI the negotiation overhead and lack of fairness was awful, but I
haven't tested this.)
On Tue, 2006-05-30 at 11:43 -0500, Anton Rang wrote:
Sure, the block size may be 128KB, but ZFS can bundle more than one
per-file/transaction
But it doesn't right now, as far as I can tell.
The protocol overhead is still orders of magnitude faster than a
rev. Sure, there are pathological cases such as FC-AL over
200kms with 100+ nodes, but most folks won't hurt themselves like
that.
OK. Let's take 4 Gb FC (e.g. array hardware). Sending 128 KB will take
roughly 330 microseconds. If we're going to achieve 95% of theoretical
rate, then each transaction can have no more than 5% of that for
overhead,
or 16 microseconds. That's pretty darn fast. For that matter, the
Solaris host would have to initiate 3,000 writes per second to keep the
channel busy. For each channel. And a host might well have 20
channels.
Can our FC stack do that? Not yet, though it's been looked at....
At 16 MB [why 16? because we can't do 32 MB in a WRITE(10) command] we
have some more leeway. Sending 16 MB will take roughly 42 ms. Each
transaction can take 5% of that, or 2 ms, for overhead, and still reach
the 95% mark. And we only need to issue 24 commands per second to keep
the channel saturated. No problem....
Single disks still run FC at 2 Gb, so the numbers above are roughly
halved, and since it takes 2-4 disks to max out a channel, you can
also multiply the allowable overhead time on the disk by a factor of
2-4. That gives the disk about 16*2*4 = 128 microseconds to process
a command. The disk might be able to do that. Solaris (and the HBA)
still need to push out 1500 writes per second (per channel), though.
A good HBA may be able to do that....
For modern disks, multiple 128kByte transfers will spend a long time
in the disk's buffer cache waiting to be written to media.
They shouldn't spend that long, really. Today's Cheetah has a 200 MB/
sec
interface, and a 59-118 MB/sec transfer rate to media, so at best we can
fill the cache a little over twice as fast as it empties. (Once we put
multiple disks on the channel, it's easy to have the cache empty faster
than we fill it -- this is actually the desirable case, so that we're
not waiting on the media.)
Very few disks have 16MByte write buffer caches, so if you want to
send
such a large iop down the wire (DAS please, otherwise you kill the
SAN),
then you'll be waiting on the media anyway. The disk interconnect is
faster than the media speed. I don't see how you could avoid blowing
a rev in that case.
Yes, we'll wait on the media. We'll never lose a rev, though. Each
track on a Cheetah holds an average of 400 KB (1.6 MB/cylinder), so each
time that we change tracks, we'll likely have the buffer full with all
the data for the track. Even if we don't, FC transfers data out of
order,
so the drive can re-order if it deems necessary (in the desirable
cache-empty
case).
But until we have a well-configured test system to benchmark, this is
rather academic. :-) I suspect our customers will quickly tell us how
well ZFS works in their environments. Hopefully the answer will be
"very well" for the 95% of customers who are in the median; for those
on the "radical fringe" of I/O requirements, there will likely be more
work to do.
I'll wander off to wait for some real data. ;-)
-- Anton
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss