Hi José, Thanks for the great KIP. It took me quite a lot of time reading and digesting!
Some questions and comments: 1. About "VotersRecord": > When a KRaft voter becomes leader it will write a KRaftVersionRecord and VotersRecord to the log if the log or the latest snapshot doesn't contain any VotersRecord. This is done to make sure that the voter set in the bootstrap snapshot gets replicated to all of the voters and to not rely on all of the voters being configured with the same bootstrapped voter set. > This record will also be written to the log if it has never been written to the log in the past. This semantic is nice to have to consistently replicate the bootstrapping snapshot, at 00000000000000000000-0000000000.checkpoint, of the leader to all of the voters. If the `VotersRecord` has written into 00000000000000000000-0000000000.checkpoint, later, a new voter added. Will we write a new checkpoint to the file? If so, does that mean the `metadata.log.max.snapshot.interval.ms` will be ignored? If not, then how could we make sure the voter set in the bootstrap snapshot gets replicated to all of the voters and to not rely on all of the voters being configured with the same bootstrapped voter set? 2. After "RemoveVoter", what is the role of the node? It looks like after the voter got removed from the voter set, it is not a voter anymore. But I think it can still fetch with the leader. So it should be an observer, with a "process.role=controller"? And if the node was originally "process.role=controller,broker", it'll become a broker-only node? 3. "UpdateVoter" RPC V.S. "ControllerRegistration" RPC? To me, the purpose of "UpdateVoter" is basically identical with "ControllerRegistration", to register the voter info in the active controller/metadata. Should we re-use the existing "ControllerRegistration" one? 4. > When the kraft.version is upgraded to a version greater 0 and the version is supported by the voters, the leader will write a control record batch that will include the kraft.version record and all of the AddVoter records for all of the voters... When can't we just write a "VoterRecord" including all voters? 5. In "LeaderChangeMessage", You said: > Add an optional VoterUuid to Voter. This change is not needed for correctness but it is nice to have for tracing and debugging. The leader will write version 0 if the kraft.version is 0. The leader will write version 1 if the kraft.version is 1. And in the code change below: + { "name": "VoterUuid", "type": "int32", "versions": "1+" } So, a. this VoterUuid field is a int32 field? b. Where do we write "Kraft.version" to? 6. In the "UpdateVoter", the handling section contains many `AddVoter` names, which is confusing. 7. About this: > This KIP requires the KRaft implementation now read uncommitted data from log and snapshot to discover the voter set. This also means that the KRaft implementation needs to handle this uncommitted state getting truncated and reverted. I understand why this is required now, but I imagine it will make the protocol more complicated. Could you elaborate more about how we're going to handle the uncommitted data in the KIP? Thank you. Luke On Tue, Jan 23, 2024 at 9:00 AM José Armando García Sancio <jsan...@confluent.io.invalid> wrote: > Thanks Jason. Comments below. > > On Wed, Jan 10, 2024 at 9:06 AM Jason Gustafson > <ja...@confluent.io.invalid> wrote: > > One additional thought. It would be helpful to have an example to justify > > the need for this: > > > > > Wait for the fetch offset of the replica (ID, UUID) to catch up to the > > log end offset of the leader. > > > > It is helpful also to explain how this affects the AddVoter RPC. Do we > wait > > indefinitely? Or do we give up and return a timeout error if the new > voter > > cannot catch up? Probably the latter makes the most sense. > > Yes. I will have more details here. Jason and I discussed this offline > but waiting for the new replica to catch (to the LEO) is a heuristic > that would minimize the amount of time where the leader cannot > increase the HWM because the new replica is needed to form the > majority. A example that shows this is: > > Current Voter Set: > A: offset = 100 > B: offset = 100 > C: offset = 0 > > In this configuration the leader can continue to advance the HWM since > the majority A, B is at the HWM/LEO. > > If the user now adds a voter to the voter set: > A: offset = 100 > B: offset = 100 > C: offset = 0 > D: offset = 0 > > The leader cannot advance the HWM until either C or D catches up to > the HWM because the majority has to include one of either C, D or > both. > > Thanks, > -- > -José >