I found the cause of the problem,because I use iptables to simulate machine of active namenode crash. I try to manually shutdown the active namenode machine. After hadoop QJM failover,hbase soon be able to write. I do not know the essential reason . Thank you again for your reply!
------------------ 原始邮件 ------------------ 发件人: "Bharath Vissapragada";<[email protected]>; 发送时间: 2014年12月8日(星期一) 晚上6:35 收件人: "hbase-user"<[email protected]>; 主题: Re: After hadoop QJM failover,hbase can not write Sorry if my previous comment was unclear. dfs.client.retry.policy.enabled should be set to "false" (which is the default config). Overriding it to "true" will make the ha-client pick a wrong Retry policy. I just wanted to make sure you didn't override with a wrong setting. Regarding your question on speeding up the failover, a quick look at the codebase suggests the following configs might be relevant dfs.client.failover.max.attempts dfs.client.failover.sleep.base.millis dfs.client.failover.sleep.max.millis dfs.client.retry.max.attempts However I suggest to ask this question on hdfs lists, as you might get more relevant answer for your question since hbase is oblivious to hdfs failover. On Mon, Dec 1, 2014 at 8:51 PM, 聪聪 <[email protected]> wrote: > Thanks for you! > According to your suggestion,I configure > "dfs.client.retry.policy.enabled" to "true" in core-site.xml,and restart > making effect.I find some changes in hbase master log. In mater log,retry > information appear.But it still takes a long time to be able to write.I > want ask how long hbase can write?What is retry policy?Whether can > configure which parameters? > > > attach hbase master log: > 2014-12-01 22:47:30,487 INFO [master:l-hbase2:60000-SendThread( > l-hbase2.dba.dev.cn0.qunar.com:2181)] zookeeper.ClientCnxn: Session > establishment complete on server > l-hbase2.dba.dev.cn0.qunar.com/10.86.36.218:2181, sessionid = > 0x14a0640d2100007, negotiated timeout = 40000 > 2014-12-01 22:48:38,729 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 0 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 22:49:07,748 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 1 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 22:49:22,534 DEBUG > [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore] > balancer.BaseLoadBalancer: Not running balancer because only 1 active > regionserver(s) > 2014-12-01 22:49:34,080 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 2 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 22:49:54,752 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 3 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 22:50:19,014 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 4 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 22:50:44,438 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 5 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 22:51:05,546 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 6 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 22:51:58,980 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 7 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 22:53:33,330 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 8 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 22:54:22,533 DEBUG > [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore] > balancer.BaseLoadBalancer: Not running balancer because only 1 active > regionserver(s) > 2014-12-01 22:54:30,953 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 9 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 22:55:43,189 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 10 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 22:56:49,457 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 11 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 22:58:29,088 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 12 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 22:59:22,532 DEBUG > [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore] > balancer.BaseLoadBalancer: Not running balancer because only 1 active > regionserver(s) > 2014-12-01 22:59:25,346 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 13 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 23:00:55,023 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 14 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 23:01:59,966 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 15 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 23:02:46,067 INFO [master:l-hbase2:60000.oldLogCleaner] > ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Already tried 16 time(s); retry policy is > RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], > TryOnceThenFail] > 2014-12-01 23:03:01,073 INFO [master:l-hbase2:60000.oldLogCleaner] > retry.RetryInvocationHandler: Exception while invoking getListing of class > ClientNamenodeProtocolTranslatorPB over l-hbase1.dba.dev.cn0/ > 10.86.36.217:8020. Trying to fail over immediately. > java.net.ConnectException: Call From > l-hbase2.dba.dev.cn0.qunar.com/10.86.36.218 to l-hbase1.dba.dev.cn0:8020 > failed on connection exception: java.net.ConnectException: Connection timed > out; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) > at org.apache.hadoop.ipc.Client.call(Client.java:1415) > at org.apache.hadoop.ipc.Client.call(Client.java:1364) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy17.getListing(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:546) > at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy18.getListing(Unknown Source) > at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294) > at com.sun.proxy.$Proxy20.getListing(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1906) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1889) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:104) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:716) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:712) > at > org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1555) > at > org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1575) > at > org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:123) > at org.apache.hadoop.hbase.Chore.run(Chore.java:87) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.net.ConnectException: Connection timed out > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700) > at > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463) > at org.apache.hadoop.ipc.Client.call(Client.java:1382) > ... 28 more > 2014-12-01 23:03:01,082 DEBUG [master:l-hbase2:60000.oldLogCleaner] > master.ReplicationLogCleaner: Didn't find this log in ZK, deleting: > l-hbase3.dba.dev.cn0.qunar.com%2C60020%2C1417428218321.1417442628891.meta > > > > > > > > ------------------ 原始邮件 ------------------ > 发件人: "Bharath Vissapragada";<[email protected]>; > 发送时间: 2014年12月1日(星期一) 晚上8:27 > 收件人: "hbase-user"<[email protected]>; > > 主题: Re: After hadoop QJM failover,hbase can not write > > > > Did you override "dfs.client.retry.policy.enabled" to "true" in the > regionserver configs? > > On Mon, Dec 1, 2014 at 9:13 AM, 聪聪 <[email protected]> wrote: > > > hi,there: > > I encount a problem,it let me upset. > > > > > > I use version of hadoop is hadoop-2.3.0-cdh5.1.0,namenode HA use the > > Quorum Journal Manager (QJM) feature ,dfs.ha.fencing.methods option is > > following: > > <property> > > <name>dfs.ha.fencing.methods</name> > > <value>sshfence > > shell(q_hadoop_fence.sh $target_host $target_port) > > </value> > > </property> > > > > > > > > or > > > > > > <property> > > <name>dfs.ha.fencing.methods</name> > > <value>sshfence > > shell(/bin/true) > > </value> > > </property> > > > > > > > > I use iptables to simulate machine of active namenode crash。After > > automatic failover completed,hdfs can the normal write,for example > > ./bin/hdfs dfs -put a.txt /tmp,but hbase still can not write. > > After a very long time,hbase can write,but I can not statistic How long > > did it take. > > I want to ask: > > 1、Why hdfs Complete failover,hbase can not write? > > 2、After hdfs Complete failover,how long hbase can write? > > 3、Whether a particular parameters influence this time? > > > > > > Looking forward for your responses! > > attach regionserver log,until the following content appears to be able to > > write: > > 2014-12-01 11:35:16,965 INFO [MemStoreFlusher.6] regionserver.HRegion: > > Finished memstore flush of ~7.9 K/8096, currentsize=0/0 for region > > t,,1417403859247.645d0fbe63663fabfb73025d3eb99524. in 46ms, > sequenceid=48, > > compaction requested=false > > 2014-12-01 11:35:17,755 WARN [RpcServer.reader=1,port=60020] > > ipc.RpcServer: RpcServer.listener,port=60020: count of bytes read: 0 > > java.io.IOException: Connection reset by peer > > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > > at > > org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2248) > > at > > > org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1427) > > at > > org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:802) > > at > > > org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:593) > > at > > > org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:568) > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:744) > > > > > > > > part of the datanode log is following: > > > > > > 2014-11-28 16:51:56,420 INFO org.apache.hadoop.ipc.Client: Retrying > > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried > > 8 time(s); retry policy is > > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > > MILLISECONDS) > > 2014-11-28 16:52:12,421 INFO org.apache.hadoop.ipc.Client: Retrying > > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried > > 9 time(s); retry policy is > > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > > MILLISECONDS) > > 2014-11-28 16:52:27,422 WARN > > org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in > offerService > > java.net.ConnectException: Call From > > l-hbase3.dba.dev.cn0.qunar.com/10.86.36.219 to l-hbase1.dba.dev.cn0:8020 > > failed on connection exception: java.net.ConnectException: > > Connection timed out; For more details see: > > http://wiki.apache.org/hadoop/ConnectionRefused > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > > Method) > > at > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > > at > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > > at > java.lang.reflect.Constructor.newInstance(Constructor.java:526) > > at > > org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) > > at > org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) > > at org.apache.hadoop.ipc.Client.call(Client.java:1413) > > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > > at > > > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > > at com.sun.proxy.$Proxy9.sendHeartbeat(Unknown Source) > > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > > at > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > > at com.sun.proxy.$Proxy9.sendHeartbeat(Unknown Source) > > at > > > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:178) > > at > > > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:566) > > at > > > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:664) > > at > > > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:834) > > at java.lang.Thread.run(Thread.java:744) > > Caused by: java.net.ConnectException: Connection timed out > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > > at > > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735) > > at > > > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) > > at > > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604) > > at > > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699) > > at > > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367) > > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461) > > at org.apache.hadoop.ipc.Client.call(Client.java:1380) > > ... 14 more > > 2014-11-28 16:52:43,424 INFO org.apache.hadoop.ipc.Client: Retrying > > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried > > 0 time(s); retry policy is > > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > > MILLISECONDS) > > 2014-11-28 16:52:59,424 INFO org.apache.hadoop.ipc.Client: Retrying > > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried > > 1 time(s); retry policy is > > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > > MILLISECONDS) > > 2014-11-28 16:53:15,425 INFO org.apache.hadoop.ipc.Client: Retrying > > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried > > 2 time(s); retry policy is > > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > > MILLISECONDS) > > > > > -- > Bharath Vissapragada > <http://www.cloudera.com> -- Bharath Vissapragada <http://www.cloudera.com>
