> I am not sure how zfs would know the rate of the
> underlying disk storage 

Easy:  Is the buffer growing?  :-)

If the amount of data in the buffer is growing, you need to throttle back a bit 
until the disks catch up.  Don't stop writes until the buffer is empty, just 
slow them down to match the rate at which you're clearing data from the buffer.

In your case I'd expect to see ZFS buffer the early part of the write (so you'd 
see a very quick initial burst), but from then on you would want a continual 
stream of data to disk, at a steady rate.

To the client it should respond just like storing to disk, the only difference 
is there's actually a small delay before the data hits the disk, which will be 
proportional to the buffer size.  ZFS won't have so much opportunity to 
optimize writes, but you wouldn't get such stuttering performance.

However, reading through the other messages, if it's a known bug and ZFS 
blocking reads while writing, there may not be any need for this idea.  But 
then, that bug has been open since 2006, is flagged as fix in progress, and was 
planned for snv_51.... o_0.  So it probably is worth having this discussion.

And I may be completely wrong here, but reading that bug, it sounds like ZFS 
issues a whole bunch of writes at once as it clears the buffer, which ties in 
with the experiences of stalling actually being caused by reads being blocked.

I'm guessing given ZFS's aims it made sense to code it that way - if you're 
going to queue a bunch of transactions to make them efficient on disk, you 
don't want to interrupt that batch with a bunch of other (less efficient) 
reads. 

But the unintended side effect of this is that ZFS's attempt to optimize writes 
will causes jerky read and write behaviour any time you have a large amount of 
writes going on, and when you should be pushing the disks to 100% usage you're 
never going to reach that as it's always going to have 5s of inactivity, 
followed by 5s of running the disks flat out.

In fact, I wonder if it's a simple as the disks ending up doing 5s of reads, a 
delay for processing, 5s of writes, 5s of reads, etc...

It's probably efficient, but it's going to *feel* horrible, a 5s delay is 
easily noticeable by the end user, and is a deal breaker for many applications.

In situations like that, 5s is a *huge* amount of time, especially so if you're 
writing to a disk or storage device which has it's own caching!  Might it be 
possible to keep the 5s buffer for ordering transactions, but then commit that 
as a larger number of small transactions instead of one huge one?

The number of transactions could even be based on how busy the system is - if 
there are a lot of reads coming in, I'd be quite happy to split that into 50 
transactions.  On 10GbE, 5s is potentially 6.25GB of data.  Even split into 50 
transactions you're writing 128MB at a time, and that sounds plenty big enough 
to me!

Either way, something needs to be done.  If we move to ZFS our users are not 
going to be impressed with 5s delays on the storage system.

Finally, I do have one question for the ZFS guys:  How does the L2ARC interact 
with this?  Are reads from the L2ARC blocked, or will they happen in parallel 
with the writes to the main storage?  I suspect that a large L2ARC (potentially 
made up of SSD disks) would eliminate this problem the majority of the time.
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to