Hello Mark, Tuesday, April 15, 2008, 8:32:32 PM, you wrote:
MM> ZFS has always done a certain amount of "write throttling". In the past MM> (or the present, for those of you running S10 or pre build 87 bits) this MM> throttling was controlled by a timer and the size of the ARC: we would MM> "cut" a transaction group every 5 seconds based off of our timer, and MM> we would also "cut" a transaction group if we had more than 1/4 of the MM> ARC size worth of dirty data in the transaction group. So, for example, MM> if you have a machine with 16GB of physical memory it wouldn't be MM> unusual to see an ARC size of around 12GB. This means we would allow MM> up to 3GB of dirty data into a single transaction group (if the writes MM> complete in less than 5 seconds). Now we can have up to three MM> transaction groups "in progress" at any time: open context, quiesce MM> context, and sync context. As a final wrinkle, we also don't allow more MM> than 1/2 the ARC to be composed of dirty write data. All taken MM> together, this means that there can be up to 6GB of writes "in the pipe" MM> (using the 12GB ARC example from above). MM> Problems with this design start to show up when the write-to-disk MM> bandwidth can't keep up with the application: if the application is MM> writing at a rate of, say, 1GB/sec, it will "fill the pipe" within MM> 6 seconds. But if the IO bandwidth to disk is only 512MB/sec, its MM> going to take 12sec to get this data onto the disk. This "impedance MM> mis-match" is going to manifest as pauses: the application fills MM> the pipe, then waits for the pipe to empty, then starts writing again. MM> Note that this won't be smooth, since we need to complete an entire MM> sync phase before allowing things to progress. So you can end up MM> with IO gaps. This is probably what the original submitter is MM> experiencing. Note there are a few other subtleties here that I MM> have glossed over, but the general picture is accurate. MM> The new write throttle code put back into build 87 attempts to MM> smooth out the process. We now measure the amount of time it takes MM> to sync each transaction group, and the amount of data in that group. MM> We dynamically resize our write throttle to try to keep the sync MM> time constant (at 5secs) under write load. We also introduce MM> "fairness" delays on writers when we near pipeline capacity: each MM> write is delayed 1/100sec when we are about to "fill up". This MM> prevents a single heavy writer from "starving out" occasional MM> writers. So instead of coming to an abrupt halt when the pipeline MM> fills, we slow down our write pace. The result should be a constant MM> even IO load. MM> There is one "down side" to this new model: if a write load is very MM> "bursty", e.g., a large 5GB write followed by 30secs of idle, the MM> new code may be less efficient than the old. In the old code, all MM> of this IO would be let in at memory speed and then more slowly make MM> its way out to disk. In the new code, the writes may be slowed down. MM> The data makes its way to the disk in the same amount of time, but MM> the application takes longer. Conceptually: we are sizing the write MM> buffer to the pool bandwidth, rather than to the memory size. First - thank you for your explanation - it is very helpful. I'm worried about the last part - but it's hard to be optimal for all workloads. Nevertheless sometimes the problem is if you change the behavior from application perspective. With other file systems I guess you are able to fill in most of memory and still keep disks busy 100% of the time without IO gaps. My biggest concern were these gaps in IO as zfs should keep disks 100% busy if needed. -- Best regards, Robert Milkowski mailto:[EMAIL PROTECTED] http://milek.blogspot.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss