ZFS has always done a certain amount of "write throttling".  In the past
(or the present, for those of you running S10 or pre build 87 bits) this
throttling was controlled by a timer and the size of the ARC: we would
"cut" a transaction group every 5 seconds based off of our timer, and
we would also "cut" a transaction group if we had more than 1/4 of the
ARC size worth of dirty data in the transaction group.  So, for example,
if you have a machine with 16GB of physical memory it wouldn't be
unusual to see an ARC size of around 12GB.  This means we would allow
up to 3GB of dirty data into a single transaction group (if the writes
complete in less than 5 seconds).  Now we can have up to three
transaction groups "in progress" at any time: open context, quiesce
context, and sync context.  As a final wrinkle, we also don't allow more
than 1/2 the ARC to be composed of dirty write data.  All taken
together, this means that there can be up to 6GB of writes "in the pipe"
(using the 12GB ARC example from above).

Problems with this design start to show up when the write-to-disk
bandwidth can't keep up with the application: if the application is
writing at a rate of, say, 1GB/sec, it will "fill the pipe" within
6 seconds.  But if the IO bandwidth to disk is only 512MB/sec, its
going to take 12sec to get this data onto the disk.  This "impedance
mis-match" is going to manifest as pauses:  the application fills
the pipe, then waits for the pipe to empty, then starts writing again.
Note that this won't be smooth, since we need to complete an entire
sync phase before allowing things to progress.  So you can end up
with IO gaps.  This is probably what the original submitter is
experiencing.  Note there are a few other subtleties here that I
have glossed over, but the general picture is accurate.

The new write throttle code put back into build 87 attempts to
smooth out the process.  We now measure the amount of time it takes
to sync each transaction group, and the amount of data in that group.
We dynamically resize our write throttle to try to keep the sync
time constant (at 5secs) under write load.  We also introduce
"fairness" delays on writers when we near pipeline capacity: each
write is delayed 1/100sec when we are about to "fill up".  This
prevents a single heavy writer from "starving out" occasional
writers.  So instead of coming to an abrupt halt when the pipeline
fills, we slow down our write pace.  The result should be a constant
even IO load.

There is one "down side" to this new model: if a write load is very
"bursty", e.g., a large 5GB write followed by 30secs of idle, the
new code may be less efficient than the old.  In the old code, all
of this IO would be let in at memory speed and then more slowly make
its way out to disk.  In the new code, the writes may be slowed down.
The data makes its way to the disk in the same amount of time, but
the application takes longer.  Conceptually: we are sizing the write
buffer to the pool bandwidth, rather than to the memory size.

Robert Milkowski wrote:
> Hello eric,
> 
> Thursday, March 27, 2008, 9:36:42 PM, you wrote:
> 
> ek> On Mar 27, 2008, at 9:24 AM, Bob Friesenhahn wrote:
>>> On Thu, 27 Mar 2008, Neelakanth Nadgir wrote:
>>>> This causes the sync to happen much faster, but as you say,  
>>>> suboptimal.
>>>> Haven't had the time to go through the bug report, but probably
>>>> CR 6429205 each zpool needs to monitor its throughput
>>>> and throttle heavy writers
>>>> will help.
>>> I hope that this feature is implemented soon, and works well. :-)
> 
> ek> Actually, this has gone back into snv_87 (and no we don't know which  
> ek> s10uX it will go into yet).
> 
> 
> Could you share more details how it works right now after change?
> 
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to