So I don't know if I'm on the right track, but I've been looking at the threadlist and findstack output from above, specifically this thread which seems to be what zpool-syspool is stuck on:
> 0xffffff000fa05c60::findstack -v stack pointer for thread ffffff000fa05c60: ffffff000fa05860 [ ffffff000fa05860 _resume_from_idle+0xf1() ] ffffff000fa05890 swtch+0x145() ffffff000fa058c0 cv_wait+0x61(ffffff000fa05e3e, ffffff000fa05e40) ffffff000fa05900 delay_common+0xab(1) ffffff000fa05940 delay+0xc4(1) ffffff000fa05960 dnode_special_close+0x28(ffffff02e8aa2050) ffffff000fa05990 dmu_objset_evict+0x160(ffffff02e5b91100) ffffff000fa05a20 dsl_dataset_user_release_sync+0x52(ffffff02e000b928, ffffff02d0b1a868, ffffff02e5b9c6e0) ffffff000fa05a70 dsl_sync_task_group_sync+0xf3(ffffff02d0b1a868, ffffff02e5b9c6e0) ffffff000fa05af0 dsl_pool_sync+0x1ec(ffffff02cd540380, 9291) ffffff000fa05ba0 spa_sync+0x37b(ffffff02cdd40b00, 9291) ffffff000fa05c40 txg_sync_thread+0x247(ffffff02cd540380) ffffff000fa05c50 thread_start+8() It seems to be trying to sync txg 9291. dmu_objset_evict is called with: > ffffff02e5b91100::print objset_t { os_dsl_dataset = 0xffffff02cd748880 os_spa = 0xffffff02cdd40b00 os_phys_buf = 0xffffff02cf426a88 os_phys = 0xffffff02e570d800 os_meta_dnode = 0xffffff02e8aa2050 os_userused_dnode = 0xffffff02e8aa1758 os_groupused_dnode = 0xffffff02e8aa1478 and the ds_snapname indeed matches the name of the snapshot I was copying with zfs send when the system hung: > ffffff02cd748880::print dsl_dataset_t ds_snapname ds_snapname = [ "20100824" ] within dmu_objset_evict() it executes: /* * We should need only a single pass over the dnode list, since * nothing can be added to the list at this point. */ (void) dmu_objset_evict_dbufs(os); dnode_special_close(os->os_meta_dnode); if (os->os_userused_dnode) { dnode_special_close(os->os_userused_dnode); dnode_special_close(os->os_groupused_dnode); and in the stack, it's calling dnode_special_close+0x28(ffffff02e8aa2050) which matches the value for os->os_meta_dnode. So I guess that means it's stuck in dnode_special_close() handling the os_meta_dnode? Looking at the code for dnode_special_close() in dnode.c seems to explain the delay+0xc4(1) in the stack: > ffffff02e8aa2050::print dnode_t dn_holds dn_holds = { dn_holds.rc_count = 0x20 } void dnode_special_close(dnode_t *dn) { /* * Wait for final references to the dnode to clear. This can * only happen if the arc is asyncronously evicting state that * has a hold on this dnode while we are trying to evict this * dnode. */ while (refcount_count(&dn->dn_holds) > 0) delay(1); dnode_destroy(dn); } But now I'm reaching the limit of what I'm able to debug, as my understanding of the inner workings of ZFS is very limited. Any thoughts or suggestions based on this analysis? At least I've learned quite a bit about mdb. :) -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss