Mukul Kumar Singh created HDFS-13327:
----------------------------------------

             Summary: Ratis should handle failure conditions where append 
request to followers can fail
                 Key: HDFS-13327
                 URL: https://issues.apache.org/jira/browse/HDFS-13327
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Mukul Kumar Singh
            Assignee: Mukul Kumar Singh


This was observed on a 3 node ratis config with ozone, this was simulated using 
TestFreon unit test.

The problem is append to {{127.0.0.1_63765}} failed because the node ran out of 
memory.
A new leader is elected which tries to send the append requests to the same 
follower, but the request is out of order.

{code}
2018-03-21 22:05:01,495 INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:handleWriteChunk(207)) - adding  index 185
2018-03-21 22:05:02,054 WARN  impl.LogAppender 
(GRpcLogAppender.java:onError(226)) - 127.0.0.1_63790 got error when appending 
entries to 127.0.0.1_63765, exception: {}.
java.io.IOException: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: 
INTERNAL: Java heap space
        at 
org.apache.ratis.grpc.RaftGrpcUtil.unwrapException(RaftGrpcUtil.java:65)
        at 
org.apache.ratis.grpc.RaftGrpcUtil.unwrapThrowable(RaftGrpcUtil.java:48)
        at 
org.apache.ratis.grpc.server.GRpcLogAppender$AppendLogResponseHandler.onError(GRpcLogAppender.java:227)
        at 
org.apache.ratis.shaded.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:395)
        at 
org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:481)
        at 
org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:398)
        at 
org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:513)
        at 
org.apache.ratis.shaded.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52)
        at 
org.apache.ratis.shaded.io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: INTERNAL: 
Java heap space
        at 
org.apache.ratis.shaded.io.grpc.Status.asRuntimeException(Status.java:545)
        ... 9 more
2018-03-21 22:05:02,088 INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:applyTransaction(263)) - removing index 185
^M 90.00% 
|███████████████████████████████████████████████████████████████████████████████████████████
          |  18/20 Time: 0:01:282018-03-21 22:05:02,622 INFO  
impl.FollowerState (FollowerState.java:r
un(72)) - 127.0.0.1_63776 changes to CANDIDATE, lastRpcTime:534, 
electionTimeout:414ms
2018-03-21 22:05:02,622 INFO  impl.RaftServerImpl 
(RaftServerImpl.java:setRole(159)) - 127.0.0.1_63776 changes role from FOLLOWER 
to CANDIDATE at term 31 for changeToCandidate
2018-03-21 22:05:02,622 INFO  impl.FollowerState (FollowerState.java:run(72)) - 
127.0.0.1_63765 changes to CANDIDATE, lastRpcTime:576, electionTimeout:411ms
2018-03-21 22:05:02,623 INFO  impl.RaftServerImpl 
(RaftServerImpl.java:setRole(159)) - 127.0.0.1_63765 changes role from FOLLOWER 
to CANDIDATE at term 31 for changeToCandidate
2018-03-21 22:05:02,623 INFO  impl.RaftServerImpl 
(ServerState.java:setLeader(215)) - 127.0.0.1_63776: change Leader from 
127.0.0.1_63790 to null at term 31 for initElection
2018-03-21 22:05:02,628 INFO  impl.RaftServerImpl 
(ServerState.java:setLeader(215)) - 127.0.0.1_63765: change Leader from 
127.0.0.1_63790 to null at term 31 for initElection
2018-03-21 22:05:02,634 INFO  impl.LeaderElection 
(LeaderElection.java:askForVotes(127)) - 127.0.0.1_63765: begin an election in 
Term 32
2018-03-21 22:05:02,634 INFO  impl.LeaderElection 
(LeaderElection.java:askForVotes(127)) - 127.0.0.1_63776: begin an election in 
Term 32
2018-03-21 22:05:02,635 INFO  impl.RaftServerImpl 
(ServerState.java:setLeader(215)) - 127.0.0.1_63790: change Leader from 
127.0.0.1_63790 to null at term 32 for updateCurrentTerm
2018-03-21 22:05:02,635 INFO  impl.RaftServerImpl 
(ServerState.java:setLeader(215)) - 127.0.0.1_63790: change Leader from 
127.0.0.1_63790 to null at term 32 for updateCurrentTerm
2018-03-21 22:05:02,636 INFO  impl.RaftServerImpl 
(RaftServerImpl.java:setRole(159)) - 127.0.0.1_63790 changes role from LEADER 
to FOLLOWER at term 32 for changeToFollower
2018-03-21 22:05:02,636 INFO  impl.PendingRequests 
(PendingRequests.java:sendNotLeaderResponses(147)) - 127.0.0.1_63790 sends 
responses before shutting down PendingRequestsHandler
2018-03-21 22:05:02,637 INFO  server.RaftServerProtocolService 
(RaftServerProtocolService.java:onCompleted(104)) - 127.0.0.1_63776: 
appendEntries completed
2018-03-21 22:05:02,640 INFO  impl.LogAppender 
(GRpcLogAppender.java:onCompleted(254)) - 127.0.0.1_63790 stops appending log 
entries to follower 127.0.0.1_63776(next=186, match=185, attendVote=true, lastR
pcSendTime=86068ms, lastRpcResponseTime=86080ms)
2018-03-21 22:05:02,646 INFO  impl.LeaderElection 
(LeaderElection.java:logAndReturn(46)) - 127.0.0.1_63776: Election PASSED; 
received 2 response(s) [127.0.0.1_63776<-127.0.0.1_63765#0:FAIL-t32, 127.0.0.1_
63776<-127.0.0.1_63790#0:OK-t32] and 0 exception(s); 127.0.0.1_63776:t32, 
leader=null, voted=127.0.0.1_63776, raftlog=[(t:31, i:185)], 
conf=[127.0.0.1_63765:127.0.0.1:63765, 127.0.0.1_63776:127.0.0.1:6377
6, 127.0.0.1_63790:127.0.0.1:63790], old=null
2018-03-21 22:05:02,647 INFO  impl.RaftServerImpl 
(RaftServerImpl.java:setRole(159)) - 127.0.0.1_63776 changes role from 
CANDIDATE to LEADER at term 32 for changeToLeader
2018-03-21 22:05:02,647 INFO  impl.RaftServerImpl 
(ServerState.java:setLeader(215)) - 127.0.0.1_63776: change Leader from null to 
127.0.0.1_63776 at term 32 for becomeLeader
2018-03-21 22:05:02,647 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
raft.server.staging.catchup.gap = 1000 (default)
2018-03-21 22:05:02,647 INFO  impl.LeaderElection 
(LeaderElection.java:logAndReturn(46)) - 127.0.0.1_63765: Election REJECTED; 
received 2 response(s) [127.0.0.1_63765<-127.0.0.1_63776#0:FAIL-t32, 
127.0.0.1_63765<-127.0.0.1_63790#0:FAIL-t32] and 0 exception(s); 
127.0.0.1_63765:t32, leader=null, voted=127.0.0.1_63765, raftlog=[(t:31, 
i:185)], conf=[127.0.0.1_63765:127.0.0.1:63765, 
127.0.0.1_63776:127.0.0.1:63776, 127.0.0.1_63790:127.0.0.1:63790], old=null
2018-03-21 22:05:02,647 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
raft.server.rpc.sleep.time = 25 ms (default)
2018-03-21 22:05:02,650 INFO  impl.RaftServerImpl 
(RaftServerImpl.java:setRole(159)) - 127.0.0.1_63765 changes role from 
CANDIDATE to FOLLOWER at term 32 for changeToFollower
2018-03-21 22:05:02,650 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
raft.server.log.appender.buffer.capacity = 134217728 (custom)
2018-03-21 22:05:02,650 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
raft.server.log.appender.batch.enabled = true (custom)
2018-03-21 22:05:02,652 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
2018-03-21 22:05:02,653 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
raft.grpc.server.leader.outstanding.appends.max = 128 (default)
2018-03-21 22:05:02,653 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
raft.server.log.appender.buffer.capacity = 134217728 (custom)
2018-03-21 22:05:02,653 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
raft.server.log.appender.batch.enabled = true (custom)
2018-03-21 22:05:02,653 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
2018-03-21 22:05:02,653 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
raft.grpc.server.leader.outstanding.appends.max = 128 (default)
2018-03-21 22:05:02,654 INFO  storage.RaftLogWorker 
(RaftLogWorker.java:rollLogSegment(256)) - Rolling 
segment:127.0.0.1_63776-RaftLogWorker index to:186
2018-03-21 22:05:02,673 INFO  impl.RaftServerImpl 
(ServerState.java:setLeader(215)) - 127.0.0.1_63765: change Leader from null to 
127.0.0.1_63776 at term 32 for appendEntries
2018-03-21 22:05:02,673 INFO  impl.RaftServerImpl 
(ServerState.java:setLeader(215)) - 127.0.0.1_63790: change Leader from null to 
127.0.0.1_63776 at term 32 for appendEntries
2018-03-21 22:05:02,674 INFO  storage.RaftLogWorker 
(RaftLogWorker.java:rollLogSegment(256)) - Rolling 
segment:127.0.0.1_63790-RaftLogWorker index to:186
2018-03-21 22:05:02,674 INFO  storage.RaftLogWorker 
(RaftLogWorker.java:rollLogSegment(256)) - Rolling 
segment:127.0.0.1_63765-RaftLogWorker index to:186
2018-03-21 22:05:02,712 INFO  server.RaftServerProtocolService 
(RaftServerProtocolService.java:onCompleted(104)) - 127.0.0.1_63765: 
appendEntries completed
2018-03-21 22:05:02,713 INFO  impl.LogAppender 
(GRpcLogAppender.java:onCompleted(254)) - 127.0.0.1_63790 stops appending log 
entries to follower 127.0.0.1_63765(next=186, match=184, attendVote=true, 
lastRpcSendTime=86081ms, lastRpcResponseTime=86157ms)
Mar 21, 2018 10:05:02 PM 
org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl 
maybeTerminateChannel
INFO: [ManagedChannelImpl@643d117e] Terminated
2018-03-21 22:05:03,016 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
raft.grpc.flow.control.window = 1MB (=1048576) (default)
2018-03-21 22:05:03,017 INFO  conf.ConfUtils (ConfUtils.java:logGet(41)) - 
raft.grpc.message.size.max = 33554432 (custom)
Mar 21, 2018 10:05:03 PM 
org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl <init>
INFO: [ManagedChannelImpl@4a23944d] Created with target 127.0.0.1:63776
2018-03-21 22:05:03,479 INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:applyTransaction(263)) - removing index 185
2018-03-21 22:05:03,562 ERROR storage.RaftLogWorker 
(ExitUtils.java:terminate(86)) - Terminating with exit status 1: 
127.0.0.1_63765-RaftLogWorker failed.
java.lang.IllegalStateException: lastWrittenIndex == 184, entry == term: 32
index: 186
noOp {
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to