Problem Abstract: TestCLI hangs Hadoop source code version: 1.0.3 Test Environment: RHEL 6.1 + IBM JAVA 7 JVM Development Environment RHEL 6.1 + IBM JAVA 7 SDK
cause of the hang = At the end of TestCLI and during tearDown() cleanup, hdfs shutdown is called. As the result of hdfs cluster shutdown, all data nodes are also subjected to shutdown. However during DataNode shutdown process, thread hangs on a call to ServerSocket.close() in DataXceiverServer>kill() method, thus AsynchronousCloseException never get generated that in turn doesn't terminate DataXceiverServer.run() thread, HANG. I have pasted a portion of testcase log file pointing at the location of the hang. Has anyone experienced a similar problem? Any suggestion on how to debug a serversocket hang ? LOG FILE: ===================================== Shutting down the Mini HDFS Cluster Shutting down DataNode 0 2012-10-08 09:58:11,347 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped SelectChannelConnector@localhost:0 2012-10-08 09:58:11,349 INFO ipc.Server (Server.java:stop(1589)) - Stopping server on 56425 2012-10-08 09:58:11,350 INFO ipc.Server (Server.java:run(1428)) - IPC Server handler 0 on 56425: exiting 2012-10-08 09:58:11,350 INFO ipc.Server (Server.java:run(1428)) - IPC Server handler 1 on 56425: exiting 2012-10-08 09:58:11,350 INFO ipc.Server (Server.java:run(1428)) - IPC Server handler 2 on 56425: exiting 2012-10-08 09:58:11,350 INFO ipc.Server (Server.java:run(466)) - Stopping IPC Server listener on 56425 2012-10-08 09:58:11,350 INFO metrics.RpcInstrumentation (RpcInstrumentation.java:shutdown(154)) - shut down before call to close socketserver /127.0.0.1:40462 bond to true <==================================================== print before call to ss.close() in DataXceiverServer>kill() <<<< HANG on ServerSocket.close() .. as the result AsynchronousCloseException never get generated ==> run() thread doesn't get terminated .. >>>>>>> 2012-10-08 09:58:11,351 INFO ipc.Server (Server.java:run(663)) - Stopping IPC Server Responder 2012-10-08 09:58:11,427 INFO datanode.DataNode (DataNode.java:run(1470)) - DatanodeRegistration(127.0.0.1:40462, storageID=DS-1568625081-127.0.0.1-40462-1349708246395, infoPort=48177, ipcPort=56425):Finishing DataNode in: FSDataset {dirpath='/home/sanjar/development/hadoop-common-1.0.3/build/test/data/dfs/data/data1/current,/home/sanjar/development/hadoop-common-1.0.3/build/test/data/dfs/data/data2/current'} 2012-10-08 09:58:11,427 WARN util.MBeans (MBeans.java:unregister(73)) - Hadoop:service=DataNode,name=DataNodeInfo javax.management.InstanceNotFoundException: Hadoop:service=DataNode,name=DataNodeInfo at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean (DefaultMBeanServerInterceptor.java:1107) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.exclusiveUnregisterMBean (DefaultMBeanServerInterceptor.java:439) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.unregisterMBean (DefaultMBeanServerInterceptor.java:427) at com.sun.jmx.mbeanserver.JmxMBeanServer.unregisterMBean (JmxMBeanServer.java:548) at org.apache.hadoop.metrics2.util.MBeans.unregister(MBeans.java:71) at org.apache.hadoop.hdfs.server.datanode.DataNode.unRegisterMXBean (DataNode.java:522) at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdown (DataNode.java:737) at org.apache.hadoop.hdfs.server.datanode.DataNode.run (DataNode.java:1471) at java.lang.Thread.run(Thread.java:779) 2012-10-08 09:58:11,428 INFO ipc.Server (Server.java:stop(1589)) - Stopping server on 56425 2012-10-08 09:58:11,428 INFO metrics.RpcInstrumentation (RpcInstrumentation.java:shutdown(154)) - shut down before call to close socketserver /127.0.0.1:40462 bond to true 2012-10-08 09:58:11,450 INFO datanode.DataBlockScanner (DataBlockScanner.java:run(630)) - Exiting DataBlockScanner thread. 2012-10-08 09:58:13,271 INFO namenode.FSNamesystem (QueueProcessingStatistics.java:logEndFirstCycle(221)) - ReplicateQueue QueueProcessingStatistics: First cycle completed 0 blocks in 0 msec 2012-10-08 09:58:13,273 INFO namenode.FSNamesystem (QueueProcessingStatistics.java:logEndLastCycle(234)) - ReplicateQueue QueueProcessingStatistics: Queue flush completed 0 blocks in 0 msec processing time, 0 msec clock time, 1 cycles 2012-10-08 09:58:13,279 INFO hdfs.StateChange (FSNamesystem.java:invalidateWorkForOneNode(3199)) - BLOCK* ask 127.0.0.1:40462 to delete blk_3105753263246905134_1138 blk_7178210216376038821_1035 blk_802519068887323225_1133 blk_-8493096021534159067_1149 blk_7602176121734813288_1171 blk_8668404524952190956_1136 blk_-6042918812810401975_1165 blk_5676309528403925090_1040 blk_1765642037028361222_1129 blk_8092034956650457458_1050 blk_-6648681910233280482_1134 blk_-1683744535295718305_1151 blk_3218539468768911033_1156 blk_5577719696046424228_1158 blk_-6520923586836362454_1172 blk_-6113270872737030099_1041 blk_5341998048760012359_1135 blk_-6969931240772459560_1121 blk_8394808289263836423_1045 blk_8539484182694882993_1146 blk_7671681651891508064_1126 blk_-2776317340570262361_1177 blk_2447083816854452599_1043 blk_-7481685483228902398_1196 blk_-6815661701610337288_1191 blk_3688560883394255137_1176 blk_1263533575967051168_1161 blk_-2692509525469741843_1130 blk_-2920344524512388710_1038 blk_2504931362409054454_1053 blk_-6706249849783565708_1132 blk_5395595878010222434_1194 blk_-4759132823454851626_1139 blk_-7607754639974630346_1140 blk_4506336083156819081_1141 blk_-2008642227253224114_1052 blk_3078592410791570491_1037 blk_9153438704688011593_1153 blk_-7131254408759933522_1122 blk_-2893067452191269835_1175 blk_939939854824782723_1174 blk_-8652591700089908126_1047 blk_-7991126287748428882_1162 blk_-6534381843790270462_1051 blk_-2480347688448975821_1167 blk_1836999865251288060_1192 blk_4837836637390714385_1124 blk_6419085183537620522_1178 blk_-1635920550266181839_1159 blk_1578539823433115335_1157 blk_1662403638206376616_1166 blk_-2376949396131258885_1046 blk_-7492133789334675866_1128 blk_-279683641357271853_1131 blk_1266005722495348888_1164 blk_1211794746875885582_1155 blk_6771999670299016507_1193 blk_-873039704108625303_1148 blk_2593251765103573846_1123 blk_-7083758500449179328_1125 blk_6057571526233026651_1127 blk_-7678793224528245937_1142 blk_5710043408364906826_1048 blk_-5745169747478406404_1195 blk_-688194968847186525_1160 blk_-8583749605651370300_1042 blk_1564352336476409053_1152 blk_-4372904772363416781_1163 blk_4037870233963603693_1147 blk_-6920818873832383331_1150 blk_-25490222785180228_1173 blk_7803320072955284668_1137 blk_1560055927983452575_1036 blk_7506121232707226228_1154 blk_6306975229951008004_1001 2012-10-08 09:58:13,280 INFO namenode.FSNamesystem (QueueProcessingStatistics.java:logEndFirstCycle(221)) - InvalidateQueue QueueProcessingStatistics: First cycle completed 75 blocks in 6 msec 2012-10-08 09:58:13,281 INFO namenode.FSNamesystem (QueueProcessingStatistics.java:logEndLastCycle(234)) - InvalidateQueue QueueProcessingStatistics: Queue flush completed 75 blocks in 6 msec processing time, 6 msec clock time, 1 cycles Best Regards Amir Sanjar System Management Architect PowerLinux Open Source Hadoop development lead IBM Senior Software Engineer Phone# 512-286-8393 Fax# 512-838-8858