Jack Vanlightly wrote: > - Regarding the removal of voters, when a leader appends a > RemoveVoterRecord to its log, it immediately switches to the new > configuration. There are two cases here: > 1. The voter being removed is the leader itself. The KIP documents that > the followers will continue to fetch from the leader despite these > followers learning that the leader has been removed from the voter set. > This is correct. It might be worth stating that the reason is that else the > cluster can get blocked from making further progress as the followers must > keep fetching from the leader else the leader cannot commit the record and > resign.
Yes. The KIP now reads: To allow this operation to be committed and for the leader to resign the followers will continue to fetch from the leader even if the leader is not part of the new voter set. In KRaft leader election is triggered when the voter hasn't received a successful response in the fetch timeout. > 2. The voter being removed is not the leader. We should document that > the leader will accept fetches with replicas ids (i.e. not observer > fetches) from followers who are not in the voter set of the leader. This > will occur because the voter being removed won't have received the > RemoveVoteRecord yet and it must be allowed to reach this record so that: In KRaft an observer is any replica with an (ID, UUID) or without an (ID, UUID) that is not part of the voter set. I added a Key Terms section with the definition of important terms that are used in the KRaft implementation and KIPs. I'll add more terms to that section as the discussion continues. > - Regarding what happens when a voter leaves the voter set. When a > non-leader voter reaches a RemoveVoterRecord where it is the subject of the > removal, does it switch to being an observer and continue fetching? When a > leader is removed, it carries on until it has committed the record or an > election occurs electing a different leader. Until that point, it is > leader but not a voter, so that makes it an observer? After it has > committed the record and resigns (or gets deposed by an election) does it > then start fetching as an observer? Yeah. Good points. I was missing Handling sections for the AddVoterRecord and RemoveVoterRecord sections. I added those sections and they go into this detail. I should point out that observer is not technically a state in KRaft. The better way to think about it is that the voter set determines which states, specifically the candidate state, a follower is allowed to transition to. > - I think the Add/RemoveVoterRecords should also include the current voter > set. This will make it easier to code against and also make > troubleshooting easier. Else voter membership can only be reliably > calculated by replaying the whole log. Yeah. The reality of the implementation is that the replicas will have to read the entire snapshot and log before they can determine the voter set. I have concerns that by adding this field it will unnecessarily complicate the snapshotting logic since it will have to remember which AddVoterRecords were already appended to the snapshot. I think we can make the topic partition snapshot and log more debuggable by improving the kafka-metadata-shell. It is not part of this KIP but I hope to write a future KIP that describes how the kafka-metadata-shell displays information about the cluster metadata partition. > - Regarding the adding of voters: > 1. We should document the problem of adding a new voter which then > causes all progress to be blocked until the voter catches up with the > leader. For example, in a 3 node cluster, we lose 1 node. We add a new node > which means we have a majority = 3, with only 3 functioning nodes. Until > the new node has caught up, the high watermark cannot advance. This can be > solved by ensuring that to add a node we start it first as an observer and > once it has caught up, send the AddVoter RPC to the leader. This leads to > the question of how an observer determines that it has caught up. Yes. I have the following in AddVoter Handling section: --- Start of Section RPCs/AddVoter/Handling from KIP --- When the leader receives an AddVoter request it will do the following: 1. Wait for the fetch offset of the replica (ID, UUID) to catch up to the log end offset of the leader. 2. Wait for until there are no uncommitted add or remove voter records. 3. Append the AddVoterRecord to the log. 4. The KRaft internal listener will read this record from the log and add the voter to the voter set. 5. Wait for the AddVoterRecord to commit using the majority of new configuration. 6. Send the AddVoter response to the client. In 1., the leader needs to wait for the replica to catch up because when the AddVoterRecord is appended to the log the set of voter changes. If the added voter is far behind then it can take some time for it to reach the HWM. During this time the leader cannot commit data and the quorum will be unavailable from the perspective of the state machine. We mitigate this by waiting for the new replica to catch up before adding it to the set of voters. In 4., the new replica will be part of the quorum so the leader will start sending BeginQuorumEpoch requests to this replica. It is possible that the new replica has not yet replicated and applied this AddVoterRecord so it doesn't know that it is a voter for this topic partition. The new replica will fail this RPC until it discovers that it is in the voter set. The leader will continue to retry until the RPC succeeds. --- End of Section RPCs/AddVoter/Handling from KIP --- The leader will track the log end offset of the to-be-added replica by tracking the offset and epoch sent in the Fetch and FetchSnapshot requests. This is possible because the to-be-added replicas will send their replica ID and UUID in those requests. The leader will not write the AddVoterRecord to the log until the leader has determined that the to-be-added replica has caught up. I think that when a replica has caught up is an implementation detail and we can have this detailed discussion in Jira or the PR. What do you think? > 2. Perhaps an administrator should send a Join RPC to the node we want > to add. It will, if it isn't already, start fetching as an observer and > automatically do the AddVoter RPC to the leader itself. This way from a > human operations point-of-view, there is only a single action to perform. I am proposing that to-be-added replicas will immediately start replicating the log from the cluster metadata leader. This is how the leader discovers them and is able to return them in the DescribeQuorum RPC. At this point these replicas are observers, not voters. At any point the operator can use the AddVoter RPC, Java Admin client or "kafka-metadata-quorum --add-voter" to add voters to the cluster metadata topic partition. This RPC will get sent to the current leader if there is one. The leader handles this request as described in the section I pasted above. I like the idea of sending it directly to the leader and having the response held until the AddVoterRecord commits. This makes it much easier for an operator to programmatically add and remove voters as it only needs to wait for a successful response from the cluster to know that the voter has been added or removed. It doesn't need to poll the cluster with DescribeQuorum to know when the voter has been added. > - For any of the community that understand or are interested in TLA+, > should we include the provisional TLA+ specification for this work? > https://github.com/Vanlightly/raft-tlaplus/blob/main/specifications/pull-raft/KRaftWithReconfig.tla Yes, Jack. Thanks a lot for your work in the TLA+ specification for KRaft and this KIP. This was critical to getting this KIP done. I have also added a link to your TLA+ specification to the Test Plan section.