jsancio commented on code in PR #18240:
URL: https://github.com/apache/kafka/pull/18240#discussion_r1899579689


##########
raft/src/main/java/org/apache/kafka/raft/QuorumState.java:
##########
@@ -39,40 +39,47 @@
  * how they are triggered:
  *
  * Resigned transitions to:
- *    Unattached: After learning of a new election with a higher epoch
- *    Candidate: After expiration of the election timeout
- *    Follower: After discovering a leader with an equal or larger epoch
+ *    Unattached:  After learning of a new election with a higher epoch
+ *    Prospective: After expiration of the election timeout
+ *    Follower:    After discovering a leader with an equal or larger epoch
  *
  * Unattached transitions to:
- *    Unattached: After learning of a new election with a higher epoch or 
after giving a binding vote
- *    Candidate: After expiration of the election timeout
- *    Follower: After discovering a leader with an equal or larger epoch
+ *    Unattached:  After learning of a new election with a higher epoch or 
after giving a binding vote
+ *    Prospective: After expiration of the election timeout
+ *    Follower:    After discovering a leader with an equal or larger epoch
+ *
+ * Prospective transitions to:
+ *    Unattached:  After learning of an election with a higher epoch, or node 
did not have last
+ *                 known leader and loses/times out election
+ *    Candidate:   After receiving a majority of PreVotes granted
+ *    Follower:    After discovering a leader with a larger epoch, or node had 
a last known leader
+ *                 and loses/times out election
  *
  * Candidate transitions to:
- *    Unattached: After learning of a new election with a higher epoch
- *    Candidate: After expiration of the election timeout
- *    Leader: After receiving a majority of votes
+ *    Unattached:  After learning of a new election with a higher epoch
+ *    Prospective: After expiration of the election timeout or loss of election
+ *    Leader:      After receiving a majority of votes
  *
  * Leader transitions to:
- *    Unattached: After learning of a new election with a higher epoch
- *    Resigned: When shutting down gracefully
+ *    Unattached:  After learning of a new election with a higher epoch
+ *    Resigned:    When shutting down gracefully
  *
  * Follower transitions to:
- *    Unattached: After learning of a new election with a higher epoch
- *    Candidate: After expiration of the fetch timeout
- *    Follower: After discovering a leader with a larger epoch
+ *    Unattached:  After learning of a new election with a higher epoch
+ *    Prospective: After expiration of the fetch timeout
+ *    Follower:    After discovering a leader with a larger epoch
  *
- * Observers follow a simpler state machine. The 
Voted/Candidate/Leader/Resigned
+ * Observers follow a simpler state machine. The 
Prospective/Candidate/Leader/Resigned

Review Comment:
   Yes. This is correct. Observers can vote for candidate (KIP-853) and 
prospective (KIP-996). This was changed as part of KIP-853 as documented 
[here](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217391519#KIP853:KRaftControllerMembershipChanges-Leaderelection).



##########
raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java:
##########
@@ -2935,14 +3014,18 @@ private long pollResigned(long currentTimeMs) {
             // until either the shutdown expires or an election bumps the epoch
             stateTimeoutMs = shutdown.remainingTimeMs();
         } else if (state.hasElectionTimeoutExpired(currentTimeMs)) {
-            if (quorum.isVoter()) {
-                transitionToCandidate(currentTimeMs);
-            } else {
+//            if (quorum.isVoter()) {
+                // canElectNewLeaderAfterOldLeaderPartitioned fails if we do 
not bump epoch since it is possible
+                // that the replica ends up as follower in the same epoch.
+                // resigned(leaderId=local) -> prospective(leaderId=local) -> 
follower(leaderId=local) which is illegal
+//                transitionToProspective(quorum.epoch() + 1, currentTimeMs);
+//                transitionToCandidate(currentTimeMs);
+//            } else {

Review Comment:
   > the existing raft event simulation tests picked up on a new bug in 
pollResigned
   What is the exact error? Let's add an unittest to one of the 
`KafkaRaftClient*Test` suite that shows the bug.
   
   > attempt to become follower of itself in epoch 5.
   Let's add a check to `transtitionToFollower` that checks that `leaderId` is 
not equal to `localId`.
   
   It makes sense to me that after the resign state the replica should always 
increase its epoch. The replica resigned from leadership at epoch X so 
eventually the epoch will be at least X + 1. Did you consider transitioning to 
candidate and relaxing the transition functions to allow both resigned and 
prospective to transition to candidate?



##########
raft/src/main/java/org/apache/kafka/raft/UnattachedState.java:
##########
@@ -71,13 +73,7 @@ public UnattachedState(
 
     @Override
     public ElectionState election() {
-        if (votedKey.isPresent()) {
-            return ElectionState.withVotedCandidate(epoch, votedKey().get(), 
voters);
-        } else if (leaderId.isPresent()) {
-            return ElectionState.withElectedLeader(epoch, leaderId.getAsInt(), 
voters);
-        } else {
-            return ElectionState.withUnknownLeader(epoch, voters);
-        }
+        return new ElectionState(epoch, leaderId, votedKey, voters);

Review Comment:
   Got it. Thanks. I see now that kraft doesn't check if both the leader and 
voted field are set during iniialization.
   
   During initialization, it does check the voted field first before checking 
the leader field. I think we should switch that order. If the leader and the 
leader endpoints are known, the replica should transition to follower 
immediately instead of needing to rediscover the leader.
   
   Let's change this comment too as it is slightly inaccurate. There are many 
reason why the replica may not send the leader endpoints. Using an old version 
for the RPC is not the only reason why the replica may send the leader id but 
not the leader endpoint:
   ```java
         private Optional<Boolean> maybeHandleCommonResponse(
             Errors error,
             OptionalInt leaderId,
             int epoch,
             Endpoints leaderEndpoints,
             Node source,
             long currentTimeMs
         ) {
             if (leaderEndpoints.isEmpty() && leaderId.isPresent()) {
                 // The response didn't include the leader endpoints because it 
is from a replica
                 // that doesn't support reconfiguration. Look up the leader 
endpoint in the
                 // voter set.
                 leaderEndpoints = partitionState
                     .lastVoterSet()
                     .listeners(leaderId.getAsInt());
             }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to