Hello Anton,

Tuesday, May 30, 2006, 9:59:09 PM, you wrote:

AR> On May 30, 2006, at 2:16 PM, Richard Elling wrote:

>> [assuming we're talking about disks and not "hardware RAID arrays"...]

AR> It'd be interesting to know how many customers plan to use raw disks,
AR> and how their performance relates to hardware arrays.  (My gut feeling
AR> is that a lot of disks on FC probably isn't too bad, though on parallel
AR> SCSI the negotiation overhead and lack of fairness was awful, but I
AR> haven't tested this.)

>> On Tue, 2006-05-30 at 11:43 -0500, Anton Rang wrote:
>>>> Sure, the block size may be 128KB, but ZFS can bundle more than one
>>>> per-file/transaction
>>>
>>> But it doesn't right now, as far as I can tell.
>>
>> The protocol overhead is still orders of magnitude faster than a
>> rev.  Sure, there are pathological cases such as FC-AL over
>> 200kms with 100+ nodes, but most folks won't hurt themselves like
>> that.

AR> OK.  Let's take 4 Gb FC (e.g. array hardware).  Sending 128 KB will take
AR> roughly 330 microseconds.  If we're going to achieve 95% of theoretical
AR> rate, then each transaction can have no more than 5% of that for  
AR> overhead,
AR> or 16 microseconds.  That's pretty darn fast.  For that matter, the
AR> Solaris host would have to initiate 3,000 writes per second to keep the
AR> channel busy.  For each channel.  And a host might well have 20  
AR> channels.
AR> Can our FC stack do that?  Not yet, though it's been looked at....

AR> At 16 MB [why 16? because we can't do 32 MB in a WRITE(10) command] we
AR> have some more leeway.  Sending 16 MB will take roughly 42 ms.  Each
AR> transaction can take 5% of that, or 2 ms, for overhead, and still reach
AR> the 95% mark.  And we only need to issue 24 commands per second to keep
AR> the channel saturated.  No problem....

AR> Single disks still run FC at 2 Gb, so the numbers above are roughly
AR> halved, and since it takes 2-4 disks to max out a channel, you can
AR> also multiply the allowable overhead time on the disk by a factor of
AR> 2-4.  That gives the disk about 16*2*4 = 128 microseconds to process
AR> a command.  The disk might be able to do that.  Solaris (and the HBA)
AR> still need to push out 1500 writes per second (per channel), though.
AR> A good HBA may be able to do that....

>> For modern disks, multiple 128kByte transfers will spend a long time
>> in the disk's buffer cache waiting to be written to media.

AR> They shouldn't spend that long, really.  Today's Cheetah has a 200 MB/
AR> sec
AR> interface, and a 59-118 MB/sec transfer rate to media, so at best we can
AR> fill the cache a little over twice as fast as it empties.  (Once we put
AR> multiple disks on the channel, it's easy to have the cache empty faster
AR> than we fill it -- this is actually the desirable case, so that we're
AR> not waiting on the media.)

>> Very few disks have 16MByte write buffer caches, so if you want to  
>> send
>> such a large iop down the wire (DAS please, otherwise you kill the  
>> SAN),
>> then you'll be waiting on the media anyway.  The disk interconnect is
>> faster than the media speed.  I don't see how you could avoid blowing
>> a rev in that case.

AR> Yes, we'll wait on the media.  We'll never lose a rev, though.  Each
AR> track on a Cheetah holds an average of 400 KB (1.6 MB/cylinder), so each
AR> time that we change tracks, we'll likely have the buffer full with all
AR> the data for the track.  Even if we don't, FC transfers data out of  
AR> order,
AR> so the drive can re-order if it deems necessary (in the desirable  
AR> cache-empty
AR> case).

AR> But until we have a well-configured test system to benchmark, this is
AR> rather academic.  :-)  I suspect our customers will quickly tell us how
AR> well ZFS works in their environments.  Hopefully the answer will be
AR> "very well" for the 95% of customers who are in the median; for those
AR> on the "radical fringe" of I/O requirements, there will likely be more
AR> work to do.

AR> I'll wander off to wait for some real data.  ;-)

Well see my earlier posts here about some basic testing sequential
writes with dd using different block sizes. It looks like using 8MB IO
size gives much better real throughput (to UFS or raw disk) than to
ZFS when it's actually written using 128KB IO sizes. And the
difference is actually quite big.

It wan noticed by Roch that it could be related to other issue with
ZFS which is being addressed - however I still feel that with
sequential writing large IOs can actually give much better throughput
than just using 128KB.

ps. I was using FC disks directly connected to host (JBOD) without HW
RAID.

-- 
Best regards,
 Robert                            mailto:[EMAIL PROTECTED]
                                       http://milek.blogspot.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to