Thanks for your reply.  I disabled write throttling, but didn't observe any
change in behavior.  After doing some more research, I have a theory as to
the root cause of the pauses that I'm observing.


Near the end of spa_sync, writes are blocked in function zil_itx_assign as
illustrated by the following lockstat output:

Adaptive mutex block: 179 events in 5.015 seconds (36 events/sec)
Count indv cuml rcnt     nsec Hottest Lock           Hottest Caller

-------------------------------------------------------------------------------
    3 100% 100% 0.00 178617192 0xffffffff82a7e4c0     zil_itx_assign+0x22


This function is blocked for 178ms while attempting to get a lock on the zfs
intent log.


The function holding the lock is zil_itx_clean as illustrated by the
following lockstat output:

Adaptive mutex hold: 146357 events in 5.059 seconds (28927 events/sec)
Count indv cuml rcnt     nsec Lock                   Caller

    1   0% 100% 0.00 178438696 0xffffffff82a7e4c0     zil_itx_clean+0xd1


Since zil_itx_clean holds a lock on the zfs intent log for 178ms, no new
writes can be performed during this time.


Looking into the source, it appears that zil_itx_clean obtains the lock on
the zfs intent log, then enters a while loop, moving the already sync'd
transactions into another list so that they can be freed.  Here's a comment
from the code within the synchronized block:
* Move the sync'd log transactions to a separate list so we can call
* kmem_free without holding the zl_lock.


So it appears that sync'ing the transactions to disk isn't causing the
delays.  Instead, the cleanup after the sync is the problem.  This cleanup
holds a lock on the zfs intent log while old/sync'd transactions are moved
out of the intent log, during which time new zfs writes are
prohibited/blocked.

At least, that's my theory.



On Fri, Feb 26, 2010 at 11:30 PM, Zhu Han <schumi....@gmail.com> wrote:

> Hi,
>
> This page may indicate the root cause.
> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle
>
> ZFS will throttle the write speed to match the write speed to the txg to
> the speed of DISK IO. If it detects the modest measure(1 tick pause) cannot
> prevent the tx group from being too large, it adopts a way to stall all
> write request. That could be the situation you have observed.
>
> However, please be notice, this is may not correct since I'm not  a
> developer of ZFS.
>
> For a workaround, you may add more disk to ZFS pool to get more bandwidth
> to alleviate the problem. Or you may want to disable write throttling if you
> are sure the write just bursts in an extended time. Again, I'm not sure
> whether the latter solution is feasible.
>
> best regards,
> hanzhu
>
>
> On Sat, Feb 27, 2010 at 2:29 AM, Bob Friesenhahn <
> bfrie...@simple.dallas.tx.us> wrote:
>
>> On Fri, 26 Feb 2010, Shane Cox wrote:
>>
>>>
>>> I've reviewed the forum archives and read a number of threads related to
>>> this issue.  However I
>>> didn't find a root-cause explanation for these pauses, only talk of how
>>> to ameliorate them.  In my
>>> particular case, I would like to know why zfs_log_writes are blocked for
>>> 180ms on a mutex (seemingly
>>> blocked on the intent log itself) when performing
>>> zil_itx_assign.  Another thread must have a lock on
>>> the intent log, no?  Overall, the system appears healthy as other system
>>> calls (e.g., reads and
>>> writes to network devices) complete successfully while writes to the
>>> intent log are blocked ... so
>>> the problem seems to be access to the zfs intent log.
>>> Any additional insight would be appreciated.
>>>
>>
>> As far as I am aware, none of the zfs authors has been willing to address
>> this issue in public.  It is not clear (to me) if the fundmental design of
>> zfs transaction groups requires that writes stop briefly until the
>> transaction group has been flushed to disk.  I suspect that this is the
>> case.
>>
>> Perhaps zfs will never meet your timing requirements.  Others here have
>> had considerable success by using RAID interface adaptor cards with
>> battery-backed cache memory and configuring those cards to "IT" JBOD mode.
>>  By limiting the TXG group size to the amount which will fit in
>> battery-backed cache memory, the time to "commit" the TXG group is
>> dramatically reduced as long as the continual write rate does not exceed
>> what the backing disks can sustain.  Unfortunately, this may increase the
>> total amount of data written to underlying storage.
>>
>>
>> Bob
>> --
>> Bob Friesenhahn
>> bfrie...@simple.dallas.tx.us,
>> http://www.simplesystems.org/users/bfriesen/
>> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to