Re: [zfs-discuss] zfs hangs with B141 when filebench runs

zhihui Chen Thu, 15 Jul 2010 17:57:20 -0700

Thanks, I have filed the bug but I dont know how to provide the crash
dump. If this bug is accepted, RE can get the crash dump file from me.


On Thu, Jul 15, 2010 at 10:33 PM, George Wilson
<george.r.wil...@oracle.com> wrote:
> I don't recall seeing this issue before. Best thing to do is file a bug and
> include a pointer to the crash dump.
>
> - George
>
> zhihui Chen wrote:
>>
>> Looks that the txg_sync_thread for this pool has been blocked and
>> never return, which leads to many other threads have been
>> blocked. I have tried to change zfs_vdev_max_pending value from 10 to
>> 35 and retested the workload serveral times, this issue
>> does not happen. But if I change it back to 10, it happens very
>> easily. Any known bug on this or any suggestion to solve this issue?
>>
>>> ffffff0502c3378c::wchaninfo -v
>>
>> ADDR             TYPE NWAITERS   THREAD           PROC
>> ffffff0502c3378c cond     1730:  ffffff051cc6b500 go_filebench
>>                                 ffffff051ce61020 go_filebench
>>                                 ffffff051cc4e4e0 go_filebench
>>                                 ffffff051d115120 go_filebench
>>                                 ffffff051e9ed000 go_filebench
>>                                 ffffff051bf644c0 go_filebench
>>                                 ffffff051c65b000 go_filebench
>>                                 ffffff051c728500 go_filebench
>>                                 ffffff050d83a8c0 go_filebench
>>                                 ffffff051c528c00 go_filebench
>>                                 ffffff051b750800 go_filebench
>>                                 ffffff051cdd7520 go_filebench
>>                                 ffffff051ce71bc0 go_filebench
>>                                 ffffff051cb5e840 go_filebench
>>                                 ffffff051cbdec60 go_filebench
>>                                 ffffff0516473c60 go_filebench
>>                                 ffffff051d132820 go_filebench
>>                                 ffffff051d13a400 go_filebench
>>                                 ffffff050fbf0b40 go_filebench
>>                                 ffffff051ce7a400 go_filebench
>>                                 ffffff051b781820 go_filebench
>>                                 ffffff051ce603e0 go_filebench
>>                                 ffffff051d1bf840 go_filebench
>>                                 ffffff051c6c24c0 go_filebench
>>                                 ffffff051d204100 go_filebench
>>                                 ffffff051cbdf160 go_filebench
>>                                 ffffff051ce52c00 go_filebench
>>                                 .......
>>>
>>> ffffff051cc6b500::findstack -v
>>
>> stack pointer for thread ffffff051cc6b500: ffffff0020a76ac0
>> [ ffffff0020a76ac0 _resume_from_idle+0xf1() ]
>>  ffffff0020a76af0 swtch+0x145()
>>  ffffff0020a76b20 cv_wait+0x61(ffffff0502c3378c, ffffff0502c33700)
>>  ffffff0020a76b70 zil_commit+0x67(ffffff0502c33700, 6b255, 14)
>>  ffffff0020a76d80 zfs_write+0xaaf(ffffff050b5c9140, ffffff0020a76e40,
>> 40, ffffff0502dab258, 0)
>>  ffffff0020a76df0 fop_write+0x6b(ffffff050b5c9140, ffffff0020a76e40,
>> 40, ffffff0502dab258, 0)
>>  ffffff0020a76ec0 pwrite64+0x244(1a, b6f2a000, 800, b841a800, 0)
>>  ffffff0020a76f10 sys_syscall32+0xff()
>>
>> From the zil_commit code, I try to find the thread whose stack have
>> function call zil_commit_writer. This thread did not
>> return back from zil_commit_write so that it will not call
>> cv_broadcast to wake up the waiting threads.
>>
>>> ffffff051d10fba0::findstack -v
>>
>> stack pointer for thread ffffff051d10fba0: ffffff0021ab9a10
>> [ ffffff0021ab9a10 _resume_from_idle+0xf1() ]
>>  ffffff0021ab9a40 swtch+0x145()
>>  ffffff0021ab9a70 cv_wait+0x61(ffffff051ae1b988, ffffff051ae1b980)
>>  ffffff0021ab9ab0 zio_wait+0x5d(ffffff051ae1b680)
>>  ffffff0021ab9b20 zil_commit_writer+0x249(ffffff0502c33700, 6b250, e)
>>  ffffff0021ab9b70 zil_commit+0x91(ffffff0502c33700, 6b250, e)
>>  ffffff0021ab9d80 zfs_write+0xaaf(ffffff050b5c9540, ffffff0021ab9e40,
>> 40, ffffff0502dab258, 0)
>>  ffffff0021ab9df0 fop_write+0x6b(ffffff050b5c9540, ffffff0021ab9e40,
>> 40, ffffff0502dab258, 0)
>>  ffffff0021ab9ec0 pwrite64+0x244(14, bfbfb800, 800, 88f3f000, 0)
>>  ffffff0021ab9f10 sys_syscall32+0xff()
>>
>>> ffffff051ae1b680::zio -r
>>
>> ADDRESS                                  TYPE  STAGE            WAITER
>> ffffff051ae1b680                         NULL  CHECKSUM_VERIFY
>>  ffffff051d10fba0
>>  ffffff051a9c1978                        WRITE VDEV_IO_START    -
>>  ffffff052454d348                       WRITE VDEV_IO_START    -
>>  ffffff051572b960                        WRITE VDEV_IO_START    -
>>  ffffff050accb330                       WRITE VDEV_IO_START    -
>>  ffffff0514453c80                        WRITE VDEV_IO_START    -
>>  ffffff0524537648                       WRITE VDEV_IO_START    -
>>  ffffff05090e9660                        WRITE VDEV_IO_START    -
>>  ffffff05151cb698                       WRITE VDEV_IO_START    -
>>  ffffff0514668658                        WRITE VDEV_IO_START    -
>>  ffffff0514835690                       WRITE VDEV_IO_START    -
>>  ffffff05198979a0                        WRITE VDEV_IO_START    -
>>  ffffff0507e1d038                       WRITE VDEV_IO_START    -
>>  ffffff0510727028                        WRITE VDEV_IO_START    -
>>  ffffff0523a25018                       WRITE VDEV_IO_START    -
>>  ffffff0523d729c0                        WRITE VDEV_IO_START    -
>>  ffffff052465b990                       WRITE VDEV_IO_START    -
>>  ffffff052395f008                        WRITE DONE             -
>>  ffffff0514cbc350                        WRITE VDEV_IO_START    -
>>  ffffff05146f2688                       WRITE VDEV_IO_START    -
>>  ffffff0509454048                        WRITE VDEV_IO_START    -
>>  ffffff0524186038                       WRITE VDEV_IO_START    -
>>  ffffff051166e9a0                        WRITE DONE             -
>>  ffffff0515256960                        WRITE VDEV_IO_START    -
>>  ffffff0518edf010                       WRITE VDEV_IO_START    -
>>  ffffff0514b2f688                        WRITE VDEV_IO_START    -
>>  ffffff05158b4040                       WRITE VDEV_IO_START    -
>>  ffffff052448d648                        WRITE DONE             -
>>  ffffff0512354380                        WRITE VDEV_IO_START    -
>>  ffffff051aafe6a0                       WRITE VDEV_IO_START    -
>>  ffffff051524e350                        WRITE VDEV_IO_START    -
>>  ffffff051a707058                       WRITE VDEV_IO_START    -
>>  ffffff0524679c88                        WRITE DONE             -
>>  ffffff051acef058                        WRITE DONE             -
>>
>>> ffffff051acef058::print zio_t io_executor
>>
>> io_executor = 0xffffff002089ac40
>>>
>>> 0xffffff002089ac40::findstack -v
>>
>> stack pointer for thread ffffff002089ac40: ffffff002089a720
>> [ ffffff002089a720 _resume_from_idle+0xf1() ]
>>  ffffff002089a750 swtch+0x145()
>>  ffffff002089a800 turnstile_block+0x760(ffffff051d186418, 0,
>> ffffff051fcf0340, fffffffffbc07db8, 0, 0)
>>  ffffff002089a860 mutex_vector_enter+0x261(ffffff051fcf0340)
>>  ffffff002089a890 txg_rele_to_sync+0x2a(ffffff05121bece8)
>>  ffffff002089a8c0 dmu_tx_commit+0xee(ffffff05121bec98)
>>  ffffff002089a8f0 zil_lwb_write_done+0x5f(ffffff051acef058)
>>  ffffff002089a960 zio_done+0x383(ffffff051acef058)
>>  ffffff002089a990 zio_execute+0x8d(ffffff051acef058)
>>  ffffff002089a9f0 zio_notify_parent+0xa6(ffffff051acef058,
>> ffffff052391b9b8, 1)
>>  ffffff002089aa60 zio_done+0x3e2(ffffff052391b9b8)
>>  ffffff002089aa90 zio_execute+0x8d(ffffff052391b9b8)
>>  ffffff002089ab30 taskq_thread+0x248(ffffff050c418910)
>>  ffffff002089ab40 thread_start+8()
>>>
>>> ffffff05121bece8::print -t txg_handle_t
>>
>> txg_handle_t {
>>    tx_cpu_t *th_cpu = 0xffffff051fcf0340
>>    uint64_t th_txg = 0xf36
>> }
>>
>>
>>> ffffff051fcf0340::mutex
>>
>>            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
>> ffffff051fcf0340 adapt ffffff050dc5d3a0      -      -     yes
>>
>>> ffffff050dc5d3a0::findstack -v
>>
>> stack pointer for thread ffffff050dc5d3a0: ffffff0023589970
>> [ ffffff0023589970 _resume_from_idle+0xf1() ]
>>  ffffff00235899a0 swtch+0x145()
>>  ffffff0023589a50 turnstile_block+0x760(ffffff051ce0c948, 0,
>> ffffff05083403c8, fffffffffbc07db8, 0, 0)
>>  ffffff0023589ab0 mutex_vector_enter+0x261(ffffff05083403c8)
>>  ffffff0023589b30 dmu_tx_try_assign+0xab(ffffff0514395018, 2)
>>  ffffff0023589b70 dmu_tx_assign+0x2a(ffffff0514395018, 2)
>>  ffffff0023589d80 zfs_write+0x65f(ffffff050b5c9640, ffffff0023589e40,
>> 40, ffffff0502dab258, 0)
>>  ffffff0023589df0 fop_write+0x6b(ffffff050b5c9640, ffffff0023589e40,
>> 40, ffffff0502dab258, 0)
>>  ffffff0023589ec0 pwrite64+0x244(16, b6f7c000, 800, a7ef7800, 0)
>>  ffffff0023589f10 sys_syscall32+0xff()
>>>
>>> ffffff0514395018::print dmu_tx_t
>>
>> {
>>    tx_holds = {
>>        list_size = 0x50
>>        list_offset = 0x8
>>        list_head = {
>>            list_next = 0xffffff0508054840
>>            list_prev = 0xffffff050da3b1f8
>>        }
>>    }
>>    tx_objset = 0xffffff05028c8940
>>    tx_dir = 0xffffff04e7785400
>>    tx_pool = 0xffffff0502ceac00
>>    tx_txg = 0xf36
>>    tx_lastsnap_txg = 0x1
>>    tx_lasttried_txg = 0
>>    tx_txgh = {
>>        th_cpu = 0xffffff051fcf0340
>>        th_txg = 0xf36
>>    }
>>    tx_tempreserve_cookie = 0
>>    tx_needassign_txh = 0
>>    tx_callbacks = {
>>        list_size = 0x20
>>        list_offset = 0
>>        list_head = {
>>            list_next = 0xffffff0514395098
>>            list_prev = 0xffffff0514395098
>>        }
>>    }
>>    tx_anyobj = 0
>>    tx_err = 0
>> }
>>>
>>> ffffff05083403c8::mutex
>>
>>            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
>> ffffff05083403c8 adapt ffffff002035cc40      -      -     yes
>>
>>> ffffff002035cc40::findstack -v
>>
>> stack pointer for thread ffffff002035cc40: ffffff002035c590
>> [ ffffff002035c590 _resume_from_idle+0xf1() ]
>>  ffffff002035c5c0 swtch+0x145()
>>  ffffff002035c5f0 cv_wait+0x61(ffffff05123ce350, ffffff05123ce348)
>>  ffffff002035c630 zio_wait+0x5d(ffffff05123ce048)
>>  ffffff002035c690 dbuf_read+0x1e8(ffffff0509c758e0, 0, a)
>>  ffffff002035c710 dmu_buf_hold+0xac(ffffff05028c8940,
>> ffffffffffffffff, 0, 0, ffffff002035c748, 1)
>>  ffffff002035c7b0 zap_lockdir+0x6d(ffffff05028c8940,
>> ffffffffffffffff, 0, 1, 1, 0, ffffff002035c7d8)
>>  ffffff002035c840 zap_lookup_norm+0x55(ffffff05028c8940,
>> ffffffffffffffff, ffffff002035c920, 8, 1, ffffff002035c8b8, 0, 0, 0
>>  , 0)
>>  ffffff002035c8a0 zap_lookup+0x2d(ffffff05028c8940, ffffffffffffffff,
>> ffffff002035c920, 8, 1, ffffff002035c8b8)
>>  ffffff002035c910 zap_increment+0x64(ffffff05028c8940,
>> ffffffffffffffff, ffffff002035c920, fffffffeffef7e00,
>>  ffffff0511d9bc80)
>>  ffffff002035c990 zap_increment_int+0x68(ffffff05028c8940,
>> ffffffffffffffff, 0, fffffffeffef7e00, ffffff0511d9bc80)
>>  ffffff002035c9f0 do_userquota_update+0x69(ffffff05028c8940,
>> 100108000, 3, 0, 0, 1, ffffff0511d9bc80)
>>  ffffff002035ca50
>> dmu_objset_do_userquota_updates+0xde(ffffff05028c8940,
>> ffffff0511d9bc80)
>>  ffffff002035cad0 dsl_pool_sync+0x112(ffffff0502ceac00, f34)
>>  ffffff002035cb80 spa_sync+0x37b(ffffff0501269580, f34)
>>  ffffff002035cc20 txg_sync_thread+0x247(ffffff0502ceac00)
>>  ffffff002035cc30 thread_start+8()
>>>
>>> ffffff05123ce048::zio -r
>>
>> ADDRESS                                  TYPE  STAGE            WAITER
>> ffffff05123ce048                         NULL  CHECKSUM_VERIFY
>>  ffffff002035cc40
>>  ffffff051a9a9338                        READ  VDEV_IO_START    -
>>  ffffff050e3a4050                       READ  VDEV_IO_DONE     -
>>   ffffff0519173c90                      READ  VDEV_IO_START    -
>>
>>> ffffff0519173c90::print zio_t io_done
>>
>> io_done = vdev_cache_fill
>>
>> The zio ffffff0519173c90 is vdec cach read rquest and can not be done
>> so that txt_sync_thread isblocked. I dont know why this zio can not be
>> satisfied and enter into done stage. I have tried to dd the raw device
>> which consists the pool when this zfs hangs, it works ok.
>>
>> Thanks
>> Zhihui
>>
>> On Mon, Jul 5, 2010 at 7:56 PM, zhihui Chen <zhch...@gmail.com> wrote:
>>>
>>> I tried to run "zfs list" on my system, but looks that this command
>>> will hangs. This command can not return even if I press "contrl+c" as
>>> following:
>>> r...@intel7:/export/bench/io/filebench/results# zfs list
>>> ^C^C^C^C
>>>
>>> ^C^C^C^C
>>>
>>>
>>>
>>>
>>> ..
>>> When this happens, I am running filebench benchmark with oltp
>>> workload. But "zpool status" shows that all pools are in good statu
>>> like following:
>>> r...@intel7:~# zpool status
>>>  pool: rpool
>>>  state: ONLINE
>>> status: The pool is formatted using an older on-disk format.  The pool
>>> can
>>>       still be used, but some features are unavailable.
>>> action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
>>>       pool will no longer be accessible on older software versions.
>>>  scan: none requested
>>> config:
>>>
>>>       NAME        STATE     READ WRITE CKSUM
>>>       rpool       ONLINE       0     0     0
>>>         c8t0d0s0  ONLINE       0     0     0
>>>
>>> errors: No known data errors
>>>
>>>  pool: tpool
>>>  state: ONLINE
>>>  scan: none requested
>>> config:
>>>
>>>       NAME        STATE     READ WRITE CKSUM
>>>       tpool       ONLINE       0     0     0
>>>         c10t1d0   ONLINE       0     0     0
>>>
>>> errors: No known data errors
>>>
>>>
>>> My system is running B141 and tpool is using the latest version 26.
>>> Tried command "truss -p `pgrep zfs`", but  it failes like following:
>>>
>>> r...@intel7:~# truss -p `pgrep zfs`
>>> truss: unanticipated system error: 5060
>>>
>>> Looks that zfs is in deadlock state, but I dont know what is the
>>> cause. I have tried to run filebench/oltp workload several times, each
>>> time it will leads to this state. But if I run filebench with other
>>> workload such as fileserver, webwerver, this issue does not happen.
>>>
>>> Thanks
>>> Zhihui
>>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs hangs with B141 when filebench runs

Reply via email to