Hello Mark,

Tuesday, April 15, 2008, 8:32:32 PM, you wrote:

MM> ZFS has always done a certain amount of "write throttling".  In the past
MM> (or the present, for those of you running S10 or pre build 87 bits) this
MM> throttling was controlled by a timer and the size of the ARC: we would
MM> "cut" a transaction group every 5 seconds based off of our timer, and
MM> we would also "cut" a transaction group if we had more than 1/4 of the
MM> ARC size worth of dirty data in the transaction group.  So, for example,
MM> if you have a machine with 16GB of physical memory it wouldn't be
MM> unusual to see an ARC size of around 12GB.  This means we would allow
MM> up to 3GB of dirty data into a single transaction group (if the writes
MM> complete in less than 5 seconds).  Now we can have up to three
MM> transaction groups "in progress" at any time: open context, quiesce
MM> context, and sync context.  As a final wrinkle, we also don't allow more
MM> than 1/2 the ARC to be composed of dirty write data.  All taken
MM> together, this means that there can be up to 6GB of writes "in the pipe"
MM> (using the 12GB ARC example from above).

MM> Problems with this design start to show up when the write-to-disk
MM> bandwidth can't keep up with the application: if the application is
MM> writing at a rate of, say, 1GB/sec, it will "fill the pipe" within
MM> 6 seconds.  But if the IO bandwidth to disk is only 512MB/sec, its
MM> going to take 12sec to get this data onto the disk.  This "impedance
MM> mis-match" is going to manifest as pauses:  the application fills
MM> the pipe, then waits for the pipe to empty, then starts writing again.
MM> Note that this won't be smooth, since we need to complete an entire
MM> sync phase before allowing things to progress.  So you can end up
MM> with IO gaps.  This is probably what the original submitter is
MM> experiencing.  Note there are a few other subtleties here that I
MM> have glossed over, but the general picture is accurate.

MM> The new write throttle code put back into build 87 attempts to
MM> smooth out the process.  We now measure the amount of time it takes
MM> to sync each transaction group, and the amount of data in that group.
MM> We dynamically resize our write throttle to try to keep the sync
MM> time constant (at 5secs) under write load.  We also introduce
MM> "fairness" delays on writers when we near pipeline capacity: each
MM> write is delayed 1/100sec when we are about to "fill up".  This
MM> prevents a single heavy writer from "starving out" occasional
MM> writers.  So instead of coming to an abrupt halt when the pipeline
MM> fills, we slow down our write pace.  The result should be a constant
MM> even IO load.

MM> There is one "down side" to this new model: if a write load is very
MM> "bursty", e.g., a large 5GB write followed by 30secs of idle, the
MM> new code may be less efficient than the old.  In the old code, all
MM> of this IO would be let in at memory speed and then more slowly make
MM> its way out to disk.  In the new code, the writes may be slowed down.
MM> The data makes its way to the disk in the same amount of time, but
MM> the application takes longer.  Conceptually: we are sizing the write
MM> buffer to the pool bandwidth, rather than to the memory size.



First - thank you for your explanation - it is very helpful.

I'm worried about the last part - but it's hard to be optimal for all
workloads. Nevertheless sometimes the problem is if you change the
behavior from application perspective. With other file systems I
guess you are able to fill in most of memory and still keep disks busy
100% of the time without IO gaps.

My biggest concern were these gaps in IO as zfs should keep disks 100%
busy if needed.



-- 
Best regards,
 Robert Milkowski                           mailto:[EMAIL PROTECTED]
                                       http://milek.blogspot.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to