Re: [zfs-discuss] System hangs during zfs send

Bryan Leaman Wed, 25 Aug 2010 14:17:56 -0700

So I don't know if I'm on the right track, but I've been looking at the 
threadlist and findstack output from above, specifically this thread which 
seems to be what zpool-syspool is stuck on:

> 0xffffff000fa05c60::findstack -v
stack pointer for thread ffffff000fa05c60: ffffff000fa05860
[ ffffff000fa05860 _resume_from_idle+0xf1() ]
  ffffff000fa05890 swtch+0x145()
  ffffff000fa058c0 cv_wait+0x61(ffffff000fa05e3e, ffffff000fa05e40)
  ffffff000fa05900 delay_common+0xab(1)
  ffffff000fa05940 delay+0xc4(1)
  ffffff000fa05960 dnode_special_close+0x28(ffffff02e8aa2050)
  ffffff000fa05990 dmu_objset_evict+0x160(ffffff02e5b91100)
  ffffff000fa05a20 dsl_dataset_user_release_sync+0x52(ffffff02e000b928,
  ffffff02d0b1a868, ffffff02e5b9c6e0)
  ffffff000fa05a70 dsl_sync_task_group_sync+0xf3(ffffff02d0b1a868,
  ffffff02e5b9c6e0)
  ffffff000fa05af0 dsl_pool_sync+0x1ec(ffffff02cd540380, 9291)
  ffffff000fa05ba0 spa_sync+0x37b(ffffff02cdd40b00, 9291)
  ffffff000fa05c40 txg_sync_thread+0x247(ffffff02cd540380)
  ffffff000fa05c50 thread_start+8()

It seems to be trying to sync txg 9291.

dmu_objset_evict is called with:

> ffffff02e5b91100::print objset_t
{
    os_dsl_dataset = 0xffffff02cd748880
    os_spa = 0xffffff02cdd40b00
    os_phys_buf = 0xffffff02cf426a88
    os_phys = 0xffffff02e570d800
    os_meta_dnode = 0xffffff02e8aa2050
    os_userused_dnode = 0xffffff02e8aa1758
    os_groupused_dnode = 0xffffff02e8aa1478

and the ds_snapname indeed matches the name of the snapshot I was copying with 
zfs send when the system hung:

> ffffff02cd748880::print dsl_dataset_t ds_snapname
ds_snapname = [ "20100824" ]

within dmu_objset_evict() it executes:

/*
 * We should need only a single pass over the dnode list, since
 * nothing can be added to the list at this point.
 */
(void) dmu_objset_evict_dbufs(os);

dnode_special_close(os->os_meta_dnode);
if (os->os_userused_dnode) {
        dnode_special_close(os->os_userused_dnode);
        dnode_special_close(os->os_groupused_dnode);

and in the stack, it's calling dnode_special_close+0x28(ffffff02e8aa2050) which 
matches the value for os->os_meta_dnode.  So I guess that means it's stuck in 
dnode_special_close() handling the os_meta_dnode?

Looking at the code for dnode_special_close() in dnode.c seems to explain the 
delay+0xc4(1) in the stack:

>  ffffff02e8aa2050::print dnode_t dn_holds
dn_holds = {
    dn_holds.rc_count = 0x20
}

void
dnode_special_close(dnode_t *dn)
{
        /*
         * Wait for final references to the dnode to clear.  This can
         * only happen if the arc is asyncronously evicting state that
         * has a hold on this dnode while we are trying to evict this
         * dnode.
         */
        while (refcount_count(&dn->dn_holds) > 0)
                delay(1);
        dnode_destroy(dn);
}

But now I'm reaching the limit of what I'm able to debug, as my understanding 
of the inner workings of ZFS is very limited.  Any thoughts or suggestions 
based on this analysis?  At least I've learned quite a bit about mdb. :)
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] System hangs during zfs send

Reply via email to