Re[3]: [zfs-discuss] zfs mount stuck in zil_replay

Robert Milkowski Tue, 01 Aug 2006 16:31:21 -0700

Hello Robert,

Wednesday, August 2, 2006, 12:22:11 AM, you wrote:


RM> Hello Neil,

RM> Tuesday, August 1, 2006, 8:45:02 PM, you wrote:


NP>> Robert Milkowski wrote On 08/01/06 11:41,:
>>> Hello Robert,
>>> 
>>> Monday, July 31, 2006, 12:48:30 AM, you wrote:
>>> 
>>> RM> Hello ZFS,
>>> 
>>> RM>    System was rebooted and after reboot server again
>>> 
>>> RM> System is snv_39, SPARC, T2000
>>> 
>>> RM> bash-3.00# ptree
>>> RM> 7     /lib/svc/bin/svc.startd -s
>>> RM>   163   /sbin/sh /lib/svc/method/fs-local
>>> RM>     254   /usr/sbin/zfs mount -a
>>> RM> [...]
>>> 
>>> RM> bash-3.00# zfs list|wc -l
>>> RM>       46
>>> 
>>> RM> Using df I can see most file systems are already mounted.
>>> 
>>> RM>       
>>> 
>>>>>::ps!grep zfs
>>> 
>>> RM> R    254    163      7      7      0 0x4a004000 00000600219e1800 zfs
>>> 
>>>>>00000600219e1800::walk thread|::findstack -v
>>> 
>>> RM> stack pointer for thread 300013026a0: 2a10069ebb1
>>> RM> [ 000002a10069ebb1 cv_wait+0x40() ]
>>> RM>   000002a10069ec61 txg_wait_synced+0x54(3000052f0d0, 2f2216d,
>>> RM> 3000107fb90, 3000052f110, 3000052f112, 3000052f0c8)
>>> RM>   000002a10069ed11 zil_replay+0xbc(60022609c08, 600226a1840,
>>> RM> 600226a1870, 700d13d8, 7ba2d000, 60006821200)
>>> RM>   000002a10069ee01 zfs_domount+0x1f0(7ba35000, 700d1000, 60021a8bae0, 
>>> 1, 0, 0)
>>> RM>   000002a10069eed1 zfs_mount+0x160(600225f6080, 1868800, 2a10069f9d8, 
>>> 60021a8bae0, 2f, 0)
>>> RM>   000002a10069efb1 domount+0x9b0(2a10069f8b0, 1, 600225f6080, 0, 0, 0)
>>> RM>   000002a10069f121 mount+0x110(600013fbcf8, 2a10069fad8, 0, 0, 
>>> ff38e40c, 100)
>>> RM>   000002a10069f221 syscall_ap+0x44(2a0, ffbfeca0, 1118bf4, 600013fbc40, 
>>> 15, 0)
>>> RM>   000002a10069f2e1 syscall_trap32+0xcc(52b3c8, ffbfeca0, 100, ff38e40c, 
>>> 0, 0)
>>> 
>>> 
>>> 
>>> 
>>> RM> # iostat -xnz 1
>>> RM>                     extended device statistics
>>> RM>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>>> RM>     3.0  513.0  192.0 2822.1  0.0  2.1    0.0    4.0   0  95
>>> RM> c4t600C0FF00000000009258F3E4C4C5601d0
>>> RM>                     extended device statistics
>>> RM>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>>> RM>     7.0  598.1  388.6 1832.9  0.0  2.0    0.0    3.4   0  93
>>> RM> c4t600C0FF00000000009258F3E4C4C5601d0
>>> RM> ^C
>>> RM> bash-3.00#
>>> 
>>> RM> However for many seconds no IOs at all are issued.
>>> 
>>> RM> # mpstat 1
>>> RM> [...]
>>> 
>>> RM> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  
>>> wt idl
>>> RM>   0    0   0   74   209  109    0    0    0    0    0     0    0  1   0 
>>>  99
>>> RM>   1    0   0    0     1    0    0    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>   2    0   0    0     1    0    0    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>   3    0   0    0     1    0    0    0    0    0    0     0    0 100   
>>> 0   0
>>> RM>   4    0   0    0     1    0    0    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>   5    0   0    0     6    1    8    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>   6    0   0    0     1    0    0    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>   7    0   0    0     1    0    0    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>   8    3   0    1     7    0   12    0    0    2    0   245    1  1   0 
>>>  98
>>> RM>   9    0   0    0     1    0    0    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  10    0   0    0     1    0    0    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  11    0   0    0     1    0    0    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  12    0   0    0     1    0    0    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  13    0   0    0     6    0   10    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  14    0   0    0     3    0    2    1    0    0    0   188    0  1   0 
>>>  99
>>> RM>  15    0   0    0     1    0    0    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  16    0   0    0     8    1   12    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  17    0   0    0     7    0   12    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  18    0   0    0    23    0   43    0    0    1    0     0    0  0   0 
>>> 100
>>> RM>  19    0   0    0     9    0   16    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  20    0   0    0     8    0   14    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  21    0   0    0     7    2    8    0    1    0    0     1    0  1   0 
>>>  99
>>> RM>  22    0   0   17    38   34    7    0    1    1    0     5    0  0   0 
>>> 100
>>> RM>  23    0   0    0     7    0   12    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  24    0   0    0     4    0    6    0    1    0    0     0    0  0   0 
>>> 100
>>> RM>  25    0   0    1     5    0    8    0    1    0    0     0    0  0   0 
>>> 100
>>> RM>  26    0   0    0     2    0    2    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  27    0   0    0     2    0    2    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  28    0   0    0     2    0    2    0    0    1    0     0    0  0   0 
>>> 100
>>> RM>  29    0   0    0     1    0    0    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  30    0   0    0     1    0    0    0    0    0    0     0    0  0   0 
>>> 100
>>> RM>  31    0   0    0     6    0    8    0    0    0    0     0    0  0   0 
>>> 100
>>> RM> ^C
>>> RM> bash-3.00#
>>> 
>>> RM> # dtrace -n fbt:::entry'{self->t=timestamp;}' -n
>>> RM> fbt:::return'/self->t/[EMAIL PROTECTED](timestamp-self->t);self->t=0;}' 
>>> -n tick-5s'{printa(@);exit(0);}'
>>> RM> [...]
>>> RM>   syscall_mstate                                             12636328
>>> RM>   callout_schedule_1                                         14428108
>>> RM>   hwblkclr                                                   16656308
>>> RM>   avl_rotation                                               17196252
>>> RM>   page_pptonum                                               19457456
>>> RM>   sfmmu_mlist_enter                                          20078508
>>> RM>   sfmmu_mlist_exit                                           20804176
>>> RM>   page_numtomemseg_nolock                                    23377972
>>> RM>   avl_insert                                                 23727512
>>> RM>   avl_insert_here                                            25281752
>>> RM>   sfmmu_mlspl_enter                                          32277524
>>> RM>   kmem_cache_alloc                                           38497240
>>> RM>   xc_serv                                                    94880184
>>> RM>   disp_getwork                                              118366960
>>> RM>   disp_anywork                                              191004096
>>> RM>   space_map_seg_compare                                     573117700
>>> RM>   resume                                                   7059674296
>>> 
>>> RM> #
>>> 
>>> RM> bash-3.00# dtrace -n
>>> RM> fbt::space_map_seg_compare:entry'[EMAIL PROTECTED]()]=count();}'
>>> RM> dtrace: description 'fbt::space_map_seg_compare:entry' matched 1 probe
>>> RM> ^C
>>> 
>>> 
>>> RM>               genunix`avl_find+0x38
>>> RM>               zfs`space_map_add+0x12c
>>> RM>               zfs`space_map_load+0x214
>>> RM>               zfs`metaslab_activate+0x3c
>>> RM>               zfs`metaslab_group_alloc+0x1b0
>>> RM>               zfs`metaslab_alloc_dva+0x10c
>>> RM>               zfs`metaslab_alloc+0x2c
>>> RM>               zfs`zio_dva_allocate+0x50
>>> RM>               zfs`zio_write_compress+0x1e4
>>> RM>               zfs`arc_write+0xbc
>>> RM>               zfs`dbuf_sync+0x6b0
>>> RM>               zfs`dnode_sync+0x300
>>> RM>               zfs`dmu_objset_sync_dnodes+0x68
>>> RM>               zfs`dmu_objset_sync+0x50
>>> RM>               zfs`dsl_dataset_sync+0xc
>>> RM>               zfs`dsl_pool_sync+0x60
>>> RM>               zfs`spa_sync+0xe0
>>> RM>               zfs`txg_sync_thread+0x130
>>> RM>               unix`thread_start+0x4
>>> RM>            316769
>>> 
>>> RM>               genunix`avl_find+0x38
>>> RM>               zfs`space_map_remove+0x98
>>> RM>               zfs`space_map_load+0x214
>>> RM>               zfs`metaslab_activate+0x3c
>>> RM>               zfs`metaslab_group_alloc+0x1b0
>>> RM>               zfs`metaslab_alloc_dva+0x10c
>>> RM>               zfs`metaslab_alloc+0x2c
>>> RM>               zfs`zio_dva_allocate+0x50
>>> RM>               zfs`zio_write_compress+0x1e4
>>> RM>               zfs`arc_write+0xbc
>>> RM>               zfs`dbuf_sync+0x6b0
>>> RM>               zfs`dnode_sync+0x300
>>> RM>               zfs`dmu_objset_sync_dnodes+0x68
>>> RM>               zfs`dmu_objset_sync+0x50
>>> RM>               zfs`dsl_dataset_sync+0xc
>>> RM>               zfs`dsl_pool_sync+0x60
>>> RM>               zfs`spa_sync+0xe0
>>> RM>               zfs`txg_sync_thread+0x130
>>> RM>               unix`thread_start+0x4
>>> RM>            404210
>>> RM> bash-3.00#
>>> 
>>> RM> Above was running for 3-4seconds
>>> 
>>> 
>>> 
>>> RM> I can see some progress - I mean it mounts another filesystem in every
>>> RM> 3-10 minutes.
>>> 
>>> RM> However overall it could take even over an hour to boot server before
>>> RM> zfs mount -a mounts actually all file systems. Also with default
>>> RM> Solaris configs during this stage you can't login either from console
>>> RM> or ssh and no messages are issued. That is bad also as you don't know
>>> RM> what is going on. We had to reboot again, boot to single user, run
>>> RM> sshd, login via ssh and let boot process to proceed in order to see
>>> RM> why server won't boot.
>>> 
>>> RM> Using zpool status all pools are ok and no errors at all are reported.
>>> 
>>> 
>>> RM> Summary:
>>> 
>>> RM>   1. zfs mount -a shouldn't "sleep" 1h!
>>> RM>   2. one virtual CPU core of T2000 1GH is consumed for
>>> RM>      most of the time during zfs mount -a - why is it so CPU
>>> RM>      intensive? On that server CPU became limiting in that way
>>> RM>      that usually no IOs were issued to disks during mount.
>>> RM>      When it happened on faster CPU disks were doing 100MB/s,
>>> RM>      but still it took 10-20 minutes to zfs mount disks.
>>> RM>   3. I thought zfs mount problem so it exits earlier was already
>>> RM>      addressed. So this is completely new or still something wrong is
>>> RM>      going on. I'm thinking about
>>> RM> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6377670
>>> RM>   4. Would be better if zfs mount -a would actually do it in parallel
>>> RM>      as it looks like it mounts in serial fashion right now
>>> 
>>> RM> Above behavior is really BAD and it makes ZFS looks like fsck is
>>> RM> actually build-in - I know it isn't.
>>> RM>   
>>> 
>>> 
>>> 
>>> Any ideas?

NP>> It looks to me like the mount code before replaying the log flushes all
NP>> transactions in the group (txg_wait_synced()) to get some consistency.
NP>> This then takes a very long time - searching the space map perhaps because
NP>> it's nearly full. Is it?

RM> More than 200GB free in a pool of about 4.5TB.


I run 'zfs set quota=400g nfs-s5-s8/d5201 nfs-s5-s8/d5202
nfs-s5-s8/d5203 nfs-s5-s8/d5' and it's still running after 30 minutes.

# zpool iostat nfs-s5-s8 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
nfs-s5-s8   4.24T   302G    243     25  13.1M   223K
nfs-s5-s8   4.24T   302G    409      0  15.0M      0
nfs-s5-s8   4.24T   302G    162      0  8.95M      0
nfs-s5-s8   4.24T   302G    317      0  18.9M      0
nfs-s5-s8   4.24T   302G    298     30  24.0M  90.0K
nfs-s5-s8   4.24T   302G    374      0  26.9M      0


# mdb -kw
Loading modules: [ unix krtld genunix specfs dtrace ufs sd pcisch md ip sctp 
usba lofs logindmux ptm qlc fctl cpc fcp fcip random crypto zfs ssd nfs ]
> ::ps!grep zfs
R  23426   2460  23426   2454      0 0x4a004000 0000060002a66790 zfs
> 0000060002a66790::walk thread|::findstack -v
stack pointer for thread 3006703b320: 2a10af5ccb1
[ 000002a10af5ccb1 cv_wait+0x40() ]
  000002a10af5cd61 txg_wait_synced+0x54(30055528990, 2f23aa5, 2f23aa3, 
300555289d0, 300555289d2, 30055528988)
  000002a10af5ce11 dsl_sync_task_group_wait+0x120(300c0860360, 0, 2f23aa5, 71, 
7b64ecf8, 300c0860388)
  000002a10af5cec1 dsl_sync_task_do+0x28(300c0860360, 7b611dcc, 7b611e78, 
60009e5e380, 2a10af5d828, 0)
  000002a10af5cf71 dsl_dir_set_quota+0x64(7b648400, 6400000000, 7b611c00, 
7b611c00, 60009e5e380, 0)
  000002a10af5d031 zfsdev_ioctl+0x158(701d8c00, 3f, ffbfec18, 15, 54, 
600027aa000)
  000002a10af5d0e1 fop_ioctl+0x20(600026dd280, 5a15, ffbfec18, 100003, 
6001cacde58, 120e894)
  000002a10af5d191 ioctl+0x184(3, 60011c64ab8, ffbfec18, b614c, 1010101, 5a15)
  000002a10af5d2e1 syscall_trap32+0xcc(3, 5a15, ffbfec18, b614c, 1010101, 
fffffd74)


Setting quota on file systems in other pools takes something like 10s
which still is quite long but not 30 minutes (or more)! The only
difference I see is that nfs-s5-s8 has less free space than other
pools.

This is snv_43 as I reimported pool on another server.


Something is terribly wrong here.

-- 
Best regards,
 Robert                            mailto:[EMAIL PROTECTED]
                                       http://milek.blogspot.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re[3]: [zfs-discuss] zfs mount stuck in zil_replay

Reply via email to