Soumitra Sulav created HDDS-660: ----------------------------------- Summary: StatusRuntimeException : DataNode going dead Key: HDDS-660 URL: https://issues.apache.org/jira/browse/HDDS-660 Project: Hadoop Distributed Data Store Issue Type: Bug Components: Ozone Filesystem Affects Versions: 0.3.0 Reporter: Soumitra Sulav
Issue 1 : hdfs operations throw error as *INTERNAL_ERROR* when one of the datanode is down, reason being it isn't able to replicate to minimum datanodes. _ERROR log could be more specific._ Issue 2 : Datanode process is running but is in a dead state as per SCM. Also there are exceptions in DataNode logs *StatusRuntimeException: INTERNAL: group-4D3A6FFFBFE2 not found.* Is there a way to fix any filesystem corruptions or a fsck utility like hdfs. +Steps followed to encounter the above issue :+ I had a clean setup of ozone cluster and tried starting HDP services on o3 as defaultFS. Startup of YARN failed and on seeing the logs and UI, I see that one of the datanode's state is going to DEAD. The hdfs cli commands on ozone fs gives below exception : {code:java} [root@hcatest-1 ~]# ozone fs -put ozone-site.xml / 2018-10-15 09:33:20,385 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-10-15 09:33:21,774 ERROR io.ChunkGroupOutputStream: Try to allocate more blocks for write failed, already allocated 0 blocks for this write. put: Allocate block failed, error:INTERNAL_ERROR {code} Error logs on SCM : {code:java} 2018-10-15 10:16:54,303 WARN org.apache.hadoop.hdds.scm.block.BlockManagerImpl: Unable to allocate container: {} org.apache.hadoop.hdds.scm.exceptions.SCMException at org.apache.hadoop.hdds.scm.pipelines.PipelineSelector.getReplicationPipeline(PipelineSelector.java:268) at org.apache.hadoop.hdds.scm.container.ContainerStateManager.allocateContainer(ContainerStateManager.java:270) at org.apache.hadoop.hdds.scm.container.SCMContainerManager.allocateContainer(SCMContainerManager.java:312) at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.preAllocateContainers(BlockManagerImpl.java:165) at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:279) at org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:143) at org.apache.hadoop.ozone.protocolPB.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:74) at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:6255) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) 2018-10-15 10:16:54,303 ERROR org.apache.hadoop.hdds.scm.block.BlockManagerImpl: Unable to allocate a block for the size: 268435456, type: RATIS, factor: THREE{code} DataNode error logs : {code:java} 2018-10-15 10:33:13,522 INFO org.apache.ratis.server.impl.LeaderElection: 0e4e7c9b-84a9-48a3-b44d-d906231e77b2 got exception when requesting votes: {} java.util.concurrent.ExecutionException: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 3cf6e2da-4fdb-4198-a24d-5c34ca02fe4d: group-4D3A6FFFBFE2 not found. at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214) at org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146) at org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102) Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 3cf6e2da-4fdb-4198-a24d-5c34ca02fe4d: group-4D3A6FFFBFE2 not found. at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222) at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:203) at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:132) at org.apache.ratis.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:265) at org.apache.ratis.grpc.server.GrpcServerProtocolClient.requestVote(GrpcServerProtocolClient.java:61) at org.apache.ratis.grpc.server.GrpcService.requestVote(GrpcService.java:150) at org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2018-10-15 10:33:13,523 INFO org.apache.ratis.server.impl.LeaderElection: 0e4e7c9b-84a9-48a3-b44d-d906231e77b2: Election REJECTED; received 0 response(s) [] and 2 exception(s); 0e4e7c9b-84a9-48a3-b44d-d906231e77b2:t140, leader=null, voted=0e4e7c9b-84a9-48a3-b44d-d906231e77b2, raftlog=[(t:1, i:1)], conf=0: [76b2ad5f-1a40-4a28-9fc1-b91437fe1398:172.22.119.190:9858, 0e4e7c9b-84a9-48a3-b44d-d906231e77b2:172.22.119.189:9858, 3cf6e2da-4fdb-4198-a24d-5c34ca02fe4d:172.22.119.19:9858], old=null {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org