Hello Anton, Tuesday, May 30, 2006, 9:59:09 PM, you wrote:
AR> On May 30, 2006, at 2:16 PM, Richard Elling wrote: >> [assuming we're talking about disks and not "hardware RAID arrays"...] AR> It'd be interesting to know how many customers plan to use raw disks, AR> and how their performance relates to hardware arrays. (My gut feeling AR> is that a lot of disks on FC probably isn't too bad, though on parallel AR> SCSI the negotiation overhead and lack of fairness was awful, but I AR> haven't tested this.) >> On Tue, 2006-05-30 at 11:43 -0500, Anton Rang wrote: >>>> Sure, the block size may be 128KB, but ZFS can bundle more than one >>>> per-file/transaction >>> >>> But it doesn't right now, as far as I can tell. >> >> The protocol overhead is still orders of magnitude faster than a >> rev. Sure, there are pathological cases such as FC-AL over >> 200kms with 100+ nodes, but most folks won't hurt themselves like >> that. AR> OK. Let's take 4 Gb FC (e.g. array hardware). Sending 128 KB will take AR> roughly 330 microseconds. If we're going to achieve 95% of theoretical AR> rate, then each transaction can have no more than 5% of that for AR> overhead, AR> or 16 microseconds. That's pretty darn fast. For that matter, the AR> Solaris host would have to initiate 3,000 writes per second to keep the AR> channel busy. For each channel. And a host might well have 20 AR> channels. AR> Can our FC stack do that? Not yet, though it's been looked at.... AR> At 16 MB [why 16? because we can't do 32 MB in a WRITE(10) command] we AR> have some more leeway. Sending 16 MB will take roughly 42 ms. Each AR> transaction can take 5% of that, or 2 ms, for overhead, and still reach AR> the 95% mark. And we only need to issue 24 commands per second to keep AR> the channel saturated. No problem.... AR> Single disks still run FC at 2 Gb, so the numbers above are roughly AR> halved, and since it takes 2-4 disks to max out a channel, you can AR> also multiply the allowable overhead time on the disk by a factor of AR> 2-4. That gives the disk about 16*2*4 = 128 microseconds to process AR> a command. The disk might be able to do that. Solaris (and the HBA) AR> still need to push out 1500 writes per second (per channel), though. AR> A good HBA may be able to do that.... >> For modern disks, multiple 128kByte transfers will spend a long time >> in the disk's buffer cache waiting to be written to media. AR> They shouldn't spend that long, really. Today's Cheetah has a 200 MB/ AR> sec AR> interface, and a 59-118 MB/sec transfer rate to media, so at best we can AR> fill the cache a little over twice as fast as it empties. (Once we put AR> multiple disks on the channel, it's easy to have the cache empty faster AR> than we fill it -- this is actually the desirable case, so that we're AR> not waiting on the media.) >> Very few disks have 16MByte write buffer caches, so if you want to >> send >> such a large iop down the wire (DAS please, otherwise you kill the >> SAN), >> then you'll be waiting on the media anyway. The disk interconnect is >> faster than the media speed. I don't see how you could avoid blowing >> a rev in that case. AR> Yes, we'll wait on the media. We'll never lose a rev, though. Each AR> track on a Cheetah holds an average of 400 KB (1.6 MB/cylinder), so each AR> time that we change tracks, we'll likely have the buffer full with all AR> the data for the track. Even if we don't, FC transfers data out of AR> order, AR> so the drive can re-order if it deems necessary (in the desirable AR> cache-empty AR> case). AR> But until we have a well-configured test system to benchmark, this is AR> rather academic. :-) I suspect our customers will quickly tell us how AR> well ZFS works in their environments. Hopefully the answer will be AR> "very well" for the 95% of customers who are in the median; for those AR> on the "radical fringe" of I/O requirements, there will likely be more AR> work to do. AR> I'll wander off to wait for some real data. ;-) Well see my earlier posts here about some basic testing sequential writes with dd using different block sizes. It looks like using 8MB IO size gives much better real throughput (to UFS or raw disk) than to ZFS when it's actually written using 128KB IO sizes. And the difference is actually quite big. It wan noticed by Roch that it could be related to other issue with ZFS which is being addressed - however I still feel that with sequential writing large IOs can actually give much better throughput than just using 128KB. ps. I was using FC disks directly connected to host (JBOD) without HW RAID. -- Best regards, Robert mailto:[EMAIL PROTECTED] http://milek.blogspot.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss