jsancio commented on code in PR #18240: URL: https://github.com/apache/kafka/pull/18240#discussion_r1899579689
########## raft/src/main/java/org/apache/kafka/raft/QuorumState.java: ########## @@ -39,40 +39,47 @@ * how they are triggered: * * Resigned transitions to: - * Unattached: After learning of a new election with a higher epoch - * Candidate: After expiration of the election timeout - * Follower: After discovering a leader with an equal or larger epoch + * Unattached: After learning of a new election with a higher epoch + * Prospective: After expiration of the election timeout + * Follower: After discovering a leader with an equal or larger epoch * * Unattached transitions to: - * Unattached: After learning of a new election with a higher epoch or after giving a binding vote - * Candidate: After expiration of the election timeout - * Follower: After discovering a leader with an equal or larger epoch + * Unattached: After learning of a new election with a higher epoch or after giving a binding vote + * Prospective: After expiration of the election timeout + * Follower: After discovering a leader with an equal or larger epoch + * + * Prospective transitions to: + * Unattached: After learning of an election with a higher epoch, or node did not have last + * known leader and loses/times out election + * Candidate: After receiving a majority of PreVotes granted + * Follower: After discovering a leader with a larger epoch, or node had a last known leader + * and loses/times out election * * Candidate transitions to: - * Unattached: After learning of a new election with a higher epoch - * Candidate: After expiration of the election timeout - * Leader: After receiving a majority of votes + * Unattached: After learning of a new election with a higher epoch + * Prospective: After expiration of the election timeout or loss of election + * Leader: After receiving a majority of votes * * Leader transitions to: - * Unattached: After learning of a new election with a higher epoch - * Resigned: When shutting down gracefully + * Unattached: After learning of a new election with a higher epoch + * Resigned: When shutting down gracefully * * Follower transitions to: - * Unattached: After learning of a new election with a higher epoch - * Candidate: After expiration of the fetch timeout - * Follower: After discovering a leader with a larger epoch + * Unattached: After learning of a new election with a higher epoch + * Prospective: After expiration of the fetch timeout + * Follower: After discovering a leader with a larger epoch * - * Observers follow a simpler state machine. The Voted/Candidate/Leader/Resigned + * Observers follow a simpler state machine. The Prospective/Candidate/Leader/Resigned Review Comment: Yes. This is correct. Observers can vote for candidate (KIP-853) and prospective (KIP-996). This was changed as part of KIP-853 as documented [here](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217391519#KIP853:KRaftControllerMembershipChanges-Leaderelection). ########## raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java: ########## @@ -2935,14 +3014,18 @@ private long pollResigned(long currentTimeMs) { // until either the shutdown expires or an election bumps the epoch stateTimeoutMs = shutdown.remainingTimeMs(); } else if (state.hasElectionTimeoutExpired(currentTimeMs)) { - if (quorum.isVoter()) { - transitionToCandidate(currentTimeMs); - } else { +// if (quorum.isVoter()) { + // canElectNewLeaderAfterOldLeaderPartitioned fails if we do not bump epoch since it is possible + // that the replica ends up as follower in the same epoch. + // resigned(leaderId=local) -> prospective(leaderId=local) -> follower(leaderId=local) which is illegal +// transitionToProspective(quorum.epoch() + 1, currentTimeMs); +// transitionToCandidate(currentTimeMs); +// } else { Review Comment: > the existing raft event simulation tests picked up on a new bug in pollResigned What is the exact error? Let's add an unittest to one of the `KafkaRaftClient*Test` suite that shows the bug. > attempt to become follower of itself in epoch 5. Let's add a check to `transtitionToFollower` that checks that `leaderId` is not equal to `localId`. It makes sense to me that after the resign state the replica should always increase its epoch. The replica resigned from leadership at epoch X so eventually the epoch will be at least X + 1. Did you consider transitioning to candidate and relaxing the transition functions to allow both resigned and prospective to transition to candidate? ########## raft/src/main/java/org/apache/kafka/raft/UnattachedState.java: ########## @@ -71,13 +73,7 @@ public UnattachedState( @Override public ElectionState election() { - if (votedKey.isPresent()) { - return ElectionState.withVotedCandidate(epoch, votedKey().get(), voters); - } else if (leaderId.isPresent()) { - return ElectionState.withElectedLeader(epoch, leaderId.getAsInt(), voters); - } else { - return ElectionState.withUnknownLeader(epoch, voters); - } + return new ElectionState(epoch, leaderId, votedKey, voters); Review Comment: Got it. Thanks. I see now that kraft doesn't check if both the leader and voted field are set during iniialization. During initialization, it does check the voted field first before checking the leader field. I think we should switch that order. If the leader and the leader endpoints are known, the replica should transition to follower immediately instead of needing to rediscover the leader. Let's change this comment too as it is slightly inaccurate. There are many reason why the replica may not send the leader endpoints. Using an old version for the RPC is not the only reason why the replica may send the leader id but not the leader endpoint: ```java private Optional<Boolean> maybeHandleCommonResponse( Errors error, OptionalInt leaderId, int epoch, Endpoints leaderEndpoints, Node source, long currentTimeMs ) { if (leaderEndpoints.isEmpty() && leaderId.isPresent()) { // The response didn't include the leader endpoints because it is from a replica // that doesn't support reconfiguration. Look up the leader endpoint in the // voter set. leaderEndpoints = partitionState .lastVoterSet() .listeners(leaderId.getAsInt()); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org