Re: [zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

Richard Elling Tue, 30 May 2006 14:07:58 -0700

On Tue, 2006-05-30 at 14:59 -0500, Anton Rang wrote:
> On May 30, 2006, at 2:16 PM, Richard Elling wrote:
> 
> > [assuming we're talking about disks and not "hardware RAID arrays"...]
> 
> It'd be interesting to know how many customers plan to use raw disks,
> and how their performance relates to hardware arrays.  (My gut feeling
> is that a lot of disks on FC probably isn't too bad, though on parallel
> SCSI the negotiation overhead and lack of fairness was awful, but I
> haven't tested this.)


FC-AL has much greater arbitration overhead than parallel SCSI,
though FC-AL is arguably more fair for targets.
However, parallel SCSI is at its end.  SAS/SATA is taking over
relegating FC to the non-IP SAN.

> > On Tue, 2006-05-30 at 11:43 -0500, Anton Rang wrote:
> >>> Sure, the block size may be 128KB, but ZFS can bundle more than one
> >>> per-file/transaction
> >>
> >> But it doesn't right now, as far as I can tell.
> >
> > The protocol overhead is still orders of magnitude faster than a
> > rev.  Sure, there are pathological cases such as FC-AL over
> > 200kms with 100+ nodes, but most folks won't hurt themselves like
> > that.
> 
> OK.  Let's take 4 Gb FC (e.g. array hardware).  Sending 128 KB will take
> roughly 330 microseconds.  If we're going to achieve 95% of theoretical
> rate, then each transaction can have no more than 5% of that for  
> overhead,
> or 16 microseconds.  That's pretty darn fast.  For that matter, the
> Solaris host would have to initiate 3,000 writes per second to keep the
> channel busy.  For each channel.  And a host might well have 20  
> channels.
> Can our FC stack do that?  Not yet, though it's been looked at....
> 
> At 16 MB [why 16? because we can't do 32 MB in a WRITE(10) command] we
> have some more leeway.  Sending 16 MB will take roughly 42 ms.  Each
> transaction can take 5% of that, or 2 ms, for overhead, and still reach
> the 95% mark.  And we only need to issue 24 commands per second to keep
> the channel saturated.  No problem....
> 
> Single disks still run FC at 2 Gb, so the numbers above are roughly
> halved, and since it takes 2-4 disks to max out a channel, you can
> also multiply the allowable overhead time on the disk by a factor of
> 2-4.  That gives the disk about 16*2*4 = 128 microseconds to process
> a command.  The disk might be able to do that.  Solaris (and the HBA)
> still need to push out 1500 writes per second (per channel), though.
> A good HBA may be able to do that....
> 
> > For modern disks, multiple 128kByte transfers will spend a long time
> > in the disk's buffer cache waiting to be written to media.
> 
> They shouldn't spend that long, really.  Today's Cheetah has a 200 MB/ 
> sec
> interface, and a 59-118 MB/sec transfer rate to media, so at best we can
> fill the cache a little over twice as fast as it empties.  (Once we put
> multiple disks on the channel, it's easy to have the cache empty faster
> than we fill it -- this is actually the desirable case, so that we're
> not waiting on the media.)
> 
> > Very few disks have 16MByte write buffer caches, so if you want to  
> > send
> > such a large iop down the wire (DAS please, otherwise you kill the  
> > SAN),
> > then you'll be waiting on the media anyway.  The disk interconnect is
> > faster than the media speed.  I don't see how you could avoid blowing
> > a rev in that case.
> 
> Yes, we'll wait on the media.  We'll never lose a rev, though.  Each
> track on a Cheetah holds an average of 400 KB (1.6 MB/cylinder), so each
> time that we change tracks, we'll likely have the buffer full with all
> the data for the track.  

Right.  For a ST3300007LW (300 GByte UltraSCSI 320) a 16 MByte iop takes
approximately 50ms to transfer over the SCSI bus.  The media speed is 
59-118 Mbytes/s (270-136ms).  The default cache size is 8 MBytes.  Since
the transfer is too big to fit in cache then you have to stall the
transfer or you overrun the buffer.  For reads, the disk won't be able
to keep the bus busy either.  Using such big block sizes doesn't
gain you anything.

I suspect that the full size of the buffer is not available since they
might want to use some of the space for the read cache, too.

> Even if we don't, FC transfers data out of order,
> so the drive can re-order if it deems necessary (in the desirable  
> cache-empty case).

A single iop (even a 16 MByte one) will not be re-ordered.

> But until we have a well-configured test system to benchmark, this is
> rather academic.  :-)  I suspect our customers will quickly tell us how
> well ZFS works in their environments.  Hopefully the answer will be
> "very well" for the 95% of customers who are in the median; for those
> on the "radical fringe" of I/O requirements, there will likely be more
> work to do.

Agree 100% :-)

-- 

-- richard


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

Reply via email to