Mukul Kumar Singh created HDFS-13327: ----------------------------------------
Summary: Ratis should handle failure conditions where append request to followers can fail Key: HDFS-13327 URL: https://issues.apache.org/jira/browse/HDFS-13327 Project: Hadoop HDFS Issue Type: Bug Reporter: Mukul Kumar Singh Assignee: Mukul Kumar Singh This was observed on a 3 node ratis config with ozone, this was simulated using TestFreon unit test. The problem is append to {{127.0.0.1_63765}} failed because the node ran out of memory. A new leader is elected which tries to send the append requests to the same follower, but the request is out of order. {code} 2018-03-21 22:05:01,495 INFO ratis.ContainerStateMachine (ContainerStateMachine.java:handleWriteChunk(207)) - adding index 185 2018-03-21 22:05:02,054 WARN impl.LogAppender (GRpcLogAppender.java:onError(226)) - 127.0.0.1_63790 got error when appending entries to 127.0.0.1_63765, exception: {}. java.io.IOException: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: INTERNAL: Java heap space at org.apache.ratis.grpc.RaftGrpcUtil.unwrapException(RaftGrpcUtil.java:65) at org.apache.ratis.grpc.RaftGrpcUtil.unwrapThrowable(RaftGrpcUtil.java:48) at org.apache.ratis.grpc.server.GRpcLogAppender$AppendLogResponseHandler.onError(GRpcLogAppender.java:227) at org.apache.ratis.shaded.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:395) at org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:481) at org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:398) at org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:513) at org.apache.ratis.shaded.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52) at org.apache.ratis.shaded.io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: INTERNAL: Java heap space at org.apache.ratis.shaded.io.grpc.Status.asRuntimeException(Status.java:545) ... 9 more 2018-03-21 22:05:02,088 INFO ratis.ContainerStateMachine (ContainerStateMachine.java:applyTransaction(263)) - removing index 185 ^M 90.00% |███████████████████████████████████████████████████████████████████████████████████████████ | 18/20 Time: 0:01:282018-03-21 22:05:02,622 INFO impl.FollowerState (FollowerState.java:r un(72)) - 127.0.0.1_63776 changes to CANDIDATE, lastRpcTime:534, electionTimeout:414ms 2018-03-21 22:05:02,622 INFO impl.RaftServerImpl (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63776 changes role from FOLLOWER to CANDIDATE at term 31 for changeToCandidate 2018-03-21 22:05:02,622 INFO impl.FollowerState (FollowerState.java:run(72)) - 127.0.0.1_63765 changes to CANDIDATE, lastRpcTime:576, electionTimeout:411ms 2018-03-21 22:05:02,623 INFO impl.RaftServerImpl (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63765 changes role from FOLLOWER to CANDIDATE at term 31 for changeToCandidate 2018-03-21 22:05:02,623 INFO impl.RaftServerImpl (ServerState.java:setLeader(215)) - 127.0.0.1_63776: change Leader from 127.0.0.1_63790 to null at term 31 for initElection 2018-03-21 22:05:02,628 INFO impl.RaftServerImpl (ServerState.java:setLeader(215)) - 127.0.0.1_63765: change Leader from 127.0.0.1_63790 to null at term 31 for initElection 2018-03-21 22:05:02,634 INFO impl.LeaderElection (LeaderElection.java:askForVotes(127)) - 127.0.0.1_63765: begin an election in Term 32 2018-03-21 22:05:02,634 INFO impl.LeaderElection (LeaderElection.java:askForVotes(127)) - 127.0.0.1_63776: begin an election in Term 32 2018-03-21 22:05:02,635 INFO impl.RaftServerImpl (ServerState.java:setLeader(215)) - 127.0.0.1_63790: change Leader from 127.0.0.1_63790 to null at term 32 for updateCurrentTerm 2018-03-21 22:05:02,635 INFO impl.RaftServerImpl (ServerState.java:setLeader(215)) - 127.0.0.1_63790: change Leader from 127.0.0.1_63790 to null at term 32 for updateCurrentTerm 2018-03-21 22:05:02,636 INFO impl.RaftServerImpl (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63790 changes role from LEADER to FOLLOWER at term 32 for changeToFollower 2018-03-21 22:05:02,636 INFO impl.PendingRequests (PendingRequests.java:sendNotLeaderResponses(147)) - 127.0.0.1_63790 sends responses before shutting down PendingRequestsHandler 2018-03-21 22:05:02,637 INFO server.RaftServerProtocolService (RaftServerProtocolService.java:onCompleted(104)) - 127.0.0.1_63776: appendEntries completed 2018-03-21 22:05:02,640 INFO impl.LogAppender (GRpcLogAppender.java:onCompleted(254)) - 127.0.0.1_63790 stops appending log entries to follower 127.0.0.1_63776(next=186, match=185, attendVote=true, lastR pcSendTime=86068ms, lastRpcResponseTime=86080ms) 2018-03-21 22:05:02,646 INFO impl.LeaderElection (LeaderElection.java:logAndReturn(46)) - 127.0.0.1_63776: Election PASSED; received 2 response(s) [127.0.0.1_63776<-127.0.0.1_63765#0:FAIL-t32, 127.0.0.1_ 63776<-127.0.0.1_63790#0:OK-t32] and 0 exception(s); 127.0.0.1_63776:t32, leader=null, voted=127.0.0.1_63776, raftlog=[(t:31, i:185)], conf=[127.0.0.1_63765:127.0.0.1:63765, 127.0.0.1_63776:127.0.0.1:6377 6, 127.0.0.1_63790:127.0.0.1:63790], old=null 2018-03-21 22:05:02,647 INFO impl.RaftServerImpl (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63776 changes role from CANDIDATE to LEADER at term 32 for changeToLeader 2018-03-21 22:05:02,647 INFO impl.RaftServerImpl (ServerState.java:setLeader(215)) - 127.0.0.1_63776: change Leader from null to 127.0.0.1_63776 at term 32 for becomeLeader 2018-03-21 22:05:02,647 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) - raft.server.staging.catchup.gap = 1000 (default) 2018-03-21 22:05:02,647 INFO impl.LeaderElection (LeaderElection.java:logAndReturn(46)) - 127.0.0.1_63765: Election REJECTED; received 2 response(s) [127.0.0.1_63765<-127.0.0.1_63776#0:FAIL-t32, 127.0.0.1_63765<-127.0.0.1_63790#0:FAIL-t32] and 0 exception(s); 127.0.0.1_63765:t32, leader=null, voted=127.0.0.1_63765, raftlog=[(t:31, i:185)], conf=[127.0.0.1_63765:127.0.0.1:63765, 127.0.0.1_63776:127.0.0.1:63776, 127.0.0.1_63790:127.0.0.1:63790], old=null 2018-03-21 22:05:02,647 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) - raft.server.rpc.sleep.time = 25 ms (default) 2018-03-21 22:05:02,650 INFO impl.RaftServerImpl (RaftServerImpl.java:setRole(159)) - 127.0.0.1_63765 changes role from CANDIDATE to FOLLOWER at term 32 for changeToFollower 2018-03-21 22:05:02,650 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) - raft.server.log.appender.buffer.capacity = 134217728 (custom) 2018-03-21 22:05:02,650 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) - raft.server.log.appender.batch.enabled = true (custom) 2018-03-21 22:05:02,652 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) - raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default) 2018-03-21 22:05:02,653 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) - raft.grpc.server.leader.outstanding.appends.max = 128 (default) 2018-03-21 22:05:02,653 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) - raft.server.log.appender.buffer.capacity = 134217728 (custom) 2018-03-21 22:05:02,653 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) - raft.server.log.appender.batch.enabled = true (custom) 2018-03-21 22:05:02,653 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) - raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default) 2018-03-21 22:05:02,653 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) - raft.grpc.server.leader.outstanding.appends.max = 128 (default) 2018-03-21 22:05:02,654 INFO storage.RaftLogWorker (RaftLogWorker.java:rollLogSegment(256)) - Rolling segment:127.0.0.1_63776-RaftLogWorker index to:186 2018-03-21 22:05:02,673 INFO impl.RaftServerImpl (ServerState.java:setLeader(215)) - 127.0.0.1_63765: change Leader from null to 127.0.0.1_63776 at term 32 for appendEntries 2018-03-21 22:05:02,673 INFO impl.RaftServerImpl (ServerState.java:setLeader(215)) - 127.0.0.1_63790: change Leader from null to 127.0.0.1_63776 at term 32 for appendEntries 2018-03-21 22:05:02,674 INFO storage.RaftLogWorker (RaftLogWorker.java:rollLogSegment(256)) - Rolling segment:127.0.0.1_63790-RaftLogWorker index to:186 2018-03-21 22:05:02,674 INFO storage.RaftLogWorker (RaftLogWorker.java:rollLogSegment(256)) - Rolling segment:127.0.0.1_63765-RaftLogWorker index to:186 2018-03-21 22:05:02,712 INFO server.RaftServerProtocolService (RaftServerProtocolService.java:onCompleted(104)) - 127.0.0.1_63765: appendEntries completed 2018-03-21 22:05:02,713 INFO impl.LogAppender (GRpcLogAppender.java:onCompleted(254)) - 127.0.0.1_63790 stops appending log entries to follower 127.0.0.1_63765(next=186, match=184, attendVote=true, lastRpcSendTime=86081ms, lastRpcResponseTime=86157ms) Mar 21, 2018 10:05:02 PM org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl maybeTerminateChannel INFO: [ManagedChannelImpl@643d117e] Terminated 2018-03-21 22:05:03,016 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) - raft.grpc.flow.control.window = 1MB (=1048576) (default) 2018-03-21 22:05:03,017 INFO conf.ConfUtils (ConfUtils.java:logGet(41)) - raft.grpc.message.size.max = 33554432 (custom) Mar 21, 2018 10:05:03 PM org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl <init> INFO: [ManagedChannelImpl@4a23944d] Created with target 127.0.0.1:63776 2018-03-21 22:05:03,479 INFO ratis.ContainerStateMachine (ContainerStateMachine.java:applyTransaction(263)) - removing index 185 2018-03-21 22:05:03,562 ERROR storage.RaftLogWorker (ExitUtils.java:terminate(86)) - Terminating with exit status 1: 127.0.0.1_63765-RaftLogWorker failed. java.lang.IllegalStateException: lastWrittenIndex == 184, entry == term: 32 index: 186 noOp { } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org