ZFS has always done a certain amount of "write throttling". In the past (or the present, for those of you running S10 or pre build 87 bits) this throttling was controlled by a timer and the size of the ARC: we would "cut" a transaction group every 5 seconds based off of our timer, and we would also "cut" a transaction group if we had more than 1/4 of the ARC size worth of dirty data in the transaction group. So, for example, if you have a machine with 16GB of physical memory it wouldn't be unusual to see an ARC size of around 12GB. This means we would allow up to 3GB of dirty data into a single transaction group (if the writes complete in less than 5 seconds). Now we can have up to three transaction groups "in progress" at any time: open context, quiesce context, and sync context. As a final wrinkle, we also don't allow more than 1/2 the ARC to be composed of dirty write data. All taken together, this means that there can be up to 6GB of writes "in the pipe" (using the 12GB ARC example from above).
Problems with this design start to show up when the write-to-disk bandwidth can't keep up with the application: if the application is writing at a rate of, say, 1GB/sec, it will "fill the pipe" within 6 seconds. But if the IO bandwidth to disk is only 512MB/sec, its going to take 12sec to get this data onto the disk. This "impedance mis-match" is going to manifest as pauses: the application fills the pipe, then waits for the pipe to empty, then starts writing again. Note that this won't be smooth, since we need to complete an entire sync phase before allowing things to progress. So you can end up with IO gaps. This is probably what the original submitter is experiencing. Note there are a few other subtleties here that I have glossed over, but the general picture is accurate. The new write throttle code put back into build 87 attempts to smooth out the process. We now measure the amount of time it takes to sync each transaction group, and the amount of data in that group. We dynamically resize our write throttle to try to keep the sync time constant (at 5secs) under write load. We also introduce "fairness" delays on writers when we near pipeline capacity: each write is delayed 1/100sec when we are about to "fill up". This prevents a single heavy writer from "starving out" occasional writers. So instead of coming to an abrupt halt when the pipeline fills, we slow down our write pace. The result should be a constant even IO load. There is one "down side" to this new model: if a write load is very "bursty", e.g., a large 5GB write followed by 30secs of idle, the new code may be less efficient than the old. In the old code, all of this IO would be let in at memory speed and then more slowly make its way out to disk. In the new code, the writes may be slowed down. The data makes its way to the disk in the same amount of time, but the application takes longer. Conceptually: we are sizing the write buffer to the pool bandwidth, rather than to the memory size. Robert Milkowski wrote: > Hello eric, > > Thursday, March 27, 2008, 9:36:42 PM, you wrote: > > ek> On Mar 27, 2008, at 9:24 AM, Bob Friesenhahn wrote: >>> On Thu, 27 Mar 2008, Neelakanth Nadgir wrote: >>>> This causes the sync to happen much faster, but as you say, >>>> suboptimal. >>>> Haven't had the time to go through the bug report, but probably >>>> CR 6429205 each zpool needs to monitor its throughput >>>> and throttle heavy writers >>>> will help. >>> I hope that this feature is implemented soon, and works well. :-) > > ek> Actually, this has gone back into snv_87 (and no we don't know which > ek> s10uX it will go into yet). > > > Could you share more details how it works right now after change? > _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss