Thanks for your reply. I disabled write throttling, but didn't observe any change in behavior. After doing some more research, I have a theory as to the root cause of the pauses that I'm observing.
Near the end of spa_sync, writes are blocked in function zil_itx_assign as illustrated by the following lockstat output: Adaptive mutex block: 179 events in 5.015 seconds (36 events/sec) Count indv cuml rcnt nsec Hottest Lock Hottest Caller ------------------------------------------------------------------------------- 3 100% 100% 0.00 178617192 0xffffffff82a7e4c0 zil_itx_assign+0x22 This function is blocked for 178ms while attempting to get a lock on the zfs intent log. The function holding the lock is zil_itx_clean as illustrated by the following lockstat output: Adaptive mutex hold: 146357 events in 5.059 seconds (28927 events/sec) Count indv cuml rcnt nsec Lock Caller 1 0% 100% 0.00 178438696 0xffffffff82a7e4c0 zil_itx_clean+0xd1 Since zil_itx_clean holds a lock on the zfs intent log for 178ms, no new writes can be performed during this time. Looking into the source, it appears that zil_itx_clean obtains the lock on the zfs intent log, then enters a while loop, moving the already sync'd transactions into another list so that they can be freed. Here's a comment from the code within the synchronized block: * Move the sync'd log transactions to a separate list so we can call * kmem_free without holding the zl_lock. So it appears that sync'ing the transactions to disk isn't causing the delays. Instead, the cleanup after the sync is the problem. This cleanup holds a lock on the zfs intent log while old/sync'd transactions are moved out of the intent log, during which time new zfs writes are prohibited/blocked. At least, that's my theory. On Fri, Feb 26, 2010 at 11:30 PM, Zhu Han <schumi....@gmail.com> wrote: > Hi, > > This page may indicate the root cause. > http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle > > ZFS will throttle the write speed to match the write speed to the txg to > the speed of DISK IO. If it detects the modest measure(1 tick pause) cannot > prevent the tx group from being too large, it adopts a way to stall all > write request. That could be the situation you have observed. > > However, please be notice, this is may not correct since I'm not a > developer of ZFS. > > For a workaround, you may add more disk to ZFS pool to get more bandwidth > to alleviate the problem. Or you may want to disable write throttling if you > are sure the write just bursts in an extended time. Again, I'm not sure > whether the latter solution is feasible. > > best regards, > hanzhu > > > On Sat, Feb 27, 2010 at 2:29 AM, Bob Friesenhahn < > bfrie...@simple.dallas.tx.us> wrote: > >> On Fri, 26 Feb 2010, Shane Cox wrote: >> >>> >>> I've reviewed the forum archives and read a number of threads related to >>> this issue. However I >>> didn't find a root-cause explanation for these pauses, only talk of how >>> to ameliorate them. In my >>> particular case, I would like to know why zfs_log_writes are blocked for >>> 180ms on a mutex (seemingly >>> blocked on the intent log itself) when performing >>> zil_itx_assign. Another thread must have a lock on >>> the intent log, no? Overall, the system appears healthy as other system >>> calls (e.g., reads and >>> writes to network devices) complete successfully while writes to the >>> intent log are blocked ... so >>> the problem seems to be access to the zfs intent log. >>> Any additional insight would be appreciated. >>> >> >> As far as I am aware, none of the zfs authors has been willing to address >> this issue in public. It is not clear (to me) if the fundmental design of >> zfs transaction groups requires that writes stop briefly until the >> transaction group has been flushed to disk. I suspect that this is the >> case. >> >> Perhaps zfs will never meet your timing requirements. Others here have >> had considerable success by using RAID interface adaptor cards with >> battery-backed cache memory and configuring those cards to "IT" JBOD mode. >> By limiting the TXG group size to the amount which will fit in >> battery-backed cache memory, the time to "commit" the TXG group is >> dramatically reduced as long as the continual write rate does not exceed >> what the backing disks can sustain. Unfortunately, this may increase the >> total amount of data written to underlying storage. >> >> >> Bob >> -- >> Bob Friesenhahn >> bfrie...@simple.dallas.tx.us, >> http://www.simplesystems.org/users/bfriesen/ >> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> >
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss