jsancio commented on code in PR #18240: URL: https://github.com/apache/kafka/pull/18240#discussion_r1899633981
########## raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java: ########## @@ -2935,14 +3014,18 @@ private long pollResigned(long currentTimeMs) { // until either the shutdown expires or an election bumps the epoch stateTimeoutMs = shutdown.remainingTimeMs(); } else if (state.hasElectionTimeoutExpired(currentTimeMs)) { - if (quorum.isVoter()) { - transitionToCandidate(currentTimeMs); - } else { +// if (quorum.isVoter()) { + // canElectNewLeaderAfterOldLeaderPartitioned fails if we do not bump epoch since it is possible + // that the replica ends up as follower in the same epoch. + // resigned(leaderId=local) -> prospective(leaderId=local) -> follower(leaderId=local) which is illegal +// transitionToProspective(quorum.epoch() + 1, currentTimeMs); +// transitionToCandidate(currentTimeMs); +// } else { Review Comment: > the existing raft event simulation tests picked up on a new bug in pollResigned What is the exact error? Let's add an unittest to one of the `KafkaRaftClient*Test` suite that shows the bug. > attempt to become follower of itself in epoch 5. Let's add a check to `transtitionToFollower` that checks that `leaderId` is not equal to `localId`. It makes sense to me that after the resign state the replica should always increase its epoch. The replica resigned from leadership at epoch X so eventually the epoch will be at least X + 1. Did you consider transitioning to candidate and relaxing the transition functions to allow both resigned and prospective to transition to candidate? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org