[ https://issues.apache.org/jira/browse/HDFS-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsz Wo Nicholas Sze resolved HDFS-6179. --------------------------------------- Resolution: Duplicate Let's resolve this as duplicate. > Synchronized BPOfferService - datanode locks for slow namenode reply. > --------------------------------------------------------------------- > > Key: HDFS-6179 > URL: https://issues.apache.org/jira/browse/HDFS-6179 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode > Affects Versions: 2.2.0 > Reporter: Rafal Wojdyla > > Scenario: > * 600 ative DNs > * 1 *active* NN > * HA configuration > When we start SbNN because of huge number of blocks and relative small > initialDelay - SbNN during startup will go through multiple stop-the-world > garbage collections processes (in minutes - Namenode heap size is 75GB). > We've observed that SbNN slowness affects active NN so active NN is losing > DNs (DNs are considered dead due to lack of heartbeats). We assume that some > DNs are hanging. > When DN is considered dead by active Namenode, we've observed "dead lock" in > DN process, part of stack trace: > {noformat} > "DataNode: [file:/disk1,file:/disk2] heartbeating to > standbynamenode.net/10.10.10.10:8020" daemon prio=10 tid=0x00007ff429417800 > nid=0x7f2a in Object.wait() [0x00007ff42122c000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:485) > at org.apache.hadoop.ipc.Client.call(Client.java:1333) > - locked <0x00000007db94e4c8> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1300) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at $Proxy9.registerDatanode(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at $Proxy9.registerDatanode(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reRegister(BPServiceActor.java:740) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromStandby(BPOfferService.java:603) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:506) > - locked <0x0000000780006e08> (a > org.apache.hadoop.hdfs.server.datanode.BPOfferService) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:704) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:539) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) > at java.lang.Thread.run(Thread.java:662) > "DataNode: [file:/disk1,file:/disk2] heartbeating to > activenamenode.net/10.10.10.11:8020" daemon prio=10 tid=0x00007ff428a24000 > nid=0x7f29 waiting for monitor entry [0x00007ff42132e000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.updateActorStatesFromHeartbeat(BPOfferService.java:413) > - waiting to lock <0x0000000780006e08> (a > org.apache.hadoop.hdfs.server.datanode.BPOfferService) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:535) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) > at java.lang.Thread.run(Thread.java:662) > {noformat} > Notice that it's the same lock - due to synchronization at BPOfferService. > The problem is that command from standby can't be process due to unresponsive > standby Namenode, nevertheless DN is waiting for reply from SbNN, and is > waiting long enough to be considered dead by active namenode. > Info: if we kill SbNN, DN will instantly reconnect to active NN. -- This message was sent by Atlassian JIRA (v6.2#6252)