Hi, all I tried to test the QJM HA and it always works good. But, yestoday I met an quite long time fail over with QJM. The test is base on the CDH4.3.0. The attachment is the standby namenode and the journalnode 's logs. The network cable on active namenode(also a datanode) was pulled out at about 07:24. From the standby-namenode log I found log like this: 2013-08-28 07:24:51,122 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1 Total time for transactions(ms): 1Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 41 42 2013-08-28 07:36:14,028 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 32 Total time for transactions(ms): 3Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 9 49 46
The information seems regular. The problem is that between the 2 lines there's no log in 12 minutes. There is no long gc happened. It seems the code blocked somewhere. Unfortunately, I forgot to print the jstack info T_T. Hope for your response. Best regards, Mickey