Hey Jose, A few more questions:
1. When adding a voter, the KIP proposes to return a timeout error if the voter cannot catch up in time. It might be more useful to return a more specific error so that an operator can understand why the timeout occurred. Alternatively, perhaps we could keep the generic error but add an ErrorMessage field in the response to explain the issue. (At the same time, perhaps we can add ErrorMessage to the other new RPCs. I think this is becoming more or less standard in Kafka RPCs.) 2. Say that the voters are A, B, and C, and that C is offline. What error would be returned from RemoveVoters if we try to remove B before C is back online? 3. There is a note about a voter accepting a Vote request from a voter that is not in their voter set. I guess the same comment applies to BeginQuorumEpoch? In other words, a voter should accept a new leader even if it is not in its local voter set. The note in the reference says: "Thus, servers process incoming RPC requests without consulting their current configurations." On the other hand, when it comes to election, voters will become candidates based on the latest voter set from the log (regardless whether it is committed), and they will seek out votes only from the same voter set. Is that about right? Thanks, Jason On Fri, Feb 2, 2024 at 10:55 AM Jun Rao <j...@confluent.io.invalid> wrote: > Hi, Jose, > > Thanks for the KIP. A few comments below. > > 10. kraft.version: Functionality wise, this seems very similar to > metadata.version, which is to make sure that all brokers/controllers are on > a supported version before enabling a new feature. Could you explain why we > need a new one instead of just relying on metadata.version? > > 11. Both the quorum-state file and controller.quorum.bootstrap.servers > contain endpoints. Which one takes precedence? > > 12. It seems that downgrading from this KIP is not supported? Could we have > a section to make it explicit? > > 13. controller.quorum.auto.join.enable: If this is set true, when does the > controller issue the addVoter RPC? Does it need to wait until it's caught > up? Does it issue the addVoter RPC on every restart? > > 14. "using the AddVoter RPC, the Admin client or the kafka-metadata-quorum > CLI.": In general, the operator doesn't know how to make RPC calls. So the > practical options are either CLI or adminClient. > > 15. VotersRecord: Why does it need to include name and SecurityProtocol in > EndPoints? It's meant to replace controller.quorum.voters, which only > includes host/port. > > 16. "The KRaft leader cannot do this for observers (brokers) since their > supported versions are not directly known by the KRaft leader." > Hmm, the KRaft leader receives BrokerRegistrationRequest that includes > supported feature versions, right? > > 17. UpdateVoter: > 17.1 "The leader will learn the range of supported version from the > UpdateVoter RPC". > KIP-919 introduced ControllerRegistrationRequest to do that. Do we need a > new one? > 17.2 Do we support changing the voter's endpoint dynamically? If not, it > seems that can be part of ControllerRegistrationRequest too. > > 18. AddVoter > 18.1 "This RPC can be sent to a broker or controller, when sent to a > broker, the broker will forward the request to the active controller." > If it's sent to a non-active controller, it will also be forwarded to the > active controller, right? > 18.2 Why do we need the name/security protocol fields in the request? > Currently, they can be derived from the configs. > { "name": "Listeners", "type": "[]Listener", "versions": "0+", > "about": "The endpoints that can be used to communicate with the > voter", "fields": [ > { "name": "Name", "type": "string", "versions": "0+", "mapKey": true, > "about": "The name of the endpoint" }, > { "name": "Host", "type": "string", "versions": "0+", > "about": "The hostname" }, > { "name": "Port", "type": "uint16", "versions": "0+", > "about": "The port" }, > { "name": "SecurityProtocol", "type": "int16", "versions": "0+", > "about": "The security protocol" } > ]} > 18.3 "4. Send an ApiVersions RPC to the first listener to discover the > supported kraft.version of the new voter." > Hmm, I thought that we found using ApiVersions unreliable ( > https://issues.apache.org/jira/browse/KAFKA-15230) and therefore > introduced ControllerRegistrationRequest to propagate this information. > ControllerRegistrationRequest can be made at step 1 during catchup. > 18.4 "In 4., the new replica will be part of the quorum so the leader will > start sending BeginQuorumEpoch requests to this replica." > Hmm, the leader should have sent BeginQuorumEpoch at step 1 so that the new > replica can catch up from it, right? Aslo, step 4 above only mentions > ApiVersions RPC, not BeginQuorumEpoch. > > 19. Vote: It's kind of weird that VoterUuid is at the partition level. > VoteId and VoterUuid uniquely identify a node, right? So it seems that it > should be at the same level as VoterId? > > 20. EndQuorumEpoch: "PreferredSuccessor which is an array is replica ids, > will be replaced by PreferredCandidates" > 20.1 But PreferredSuccessor is still listed in the response. > 20.2 Why does the response need to include endPoint? > > 21. "The KRaft implementation and protocol describe in KIP-595 and KIP-630 > never read from the log or snapshot." > This seems confusing. The KIP-595 protocol does read the log to replicate > it, right? > > 22. "create a snapshot at 00000000000000000000-0000000000.checkpoint with > the necessary control records (KRaftVersionRecord and VotersRecord) to make > this Kafka node the only voter for the quorum." > 22.1 It seems that currently, the bootstrap checkpoint is > named bootstrap.checkpoint. Are we changing it to > 00000000000000000000-0000000000.checkpoint? > 22.2 Just to be clear, records in the bootstrap snapshot will be propagated > to QuorumStateData, right? > > 23. kafka-metadata-quorum: In the output, we include the endpoints for the > voters, but not the observers. Why the inconsistency? > > 24. "The tool will set the TimoutMs for the AddVoterRequest to 30 seconds." > Earlier, we have "The TimoutMs for the AddVoter RPC will be set to 10 > minutes". Should we make them consistent? > > 25. "The replicas will update these order list of voters set whenever the > latest snapshot id increases, a VotersRecord control record is read from > the log and the log is truncated." > Why will the voters set change when a new snapshot is created? > > 26. VotersRecord includes MinSupportedVersion and MaxSupportedVersion. In > FeatureLevelRecord, we only have one finalized version. Should we be > consistent? > > 27. The Raft dissertation mentioned the issue of disruptive servers after > they are removed. "Once the cluster leader has created the Cnew entry, a > server that is not in Cnew will no longer receive heartbeats, so it will > time out and start new elections." > Have we addressed this issue? > > 28. There are quite a few typos in the KIP. > 28.1 "the voters are the replica ID and UUID is in its own voters set > Does not read well. > 28.2 "used to configured: > used to configure > 28.3 "When at voter becomes a leader" > when a voter > 28.4 "write an VotersRecord controler" > a VotersRecord; controller > 28.5 "will get bootstrap" > bootstrapped > 28.6 "the leader of the KRaft cluster metadata leader" > the leader of the KRaft cluster metadata partition > 28.7 "until the call as been acknowledge" > has been acknowledged > 28.8 "As describe in KIP-595" > described > 28.9 "The KRaft implementation and protocol describe" > described > 28.10 "In other, the KRaft topic partition won't" > In other words > 28.11 "greater than their epic" > epoch > 28.12 "so their state will be tracking using their ID and UUID" > tracked > > Thanks, > > Jun > > On Thu, Feb 1, 2024 at 7:27 AM Jack Vanlightly <vanligh...@apache.org> > wrote: > > > Hi Jose, > > > > I have a question about how voters and observers, which are far behind > the > > leader, catch-up when there are multiple reconfiguration commands in the > > log between their position and the end of the log. > > > > Here are some example situations that need clarification: > > > > Example 1 > > Imagine a cluster of three voters: r1, r2, r3. Voter r3 goes offline for > a > > while. In the meantime, r1 dies and gets replaced with r4, and r2 dies > > getting replaced with r5. Now the cluster is formed of r3, r4, r5. When > r3 > > comes back online, it tries to fetch from dead nodes and finally starts > > unending leader elections - stuck because it doesn't realise it's in a > > stale configuration whose members are all dead except for itself. > > > > Example 2 > > Imagine a cluster of three voters: r1, r2, r3. Voter r3 goes offline then > > comes back and discovers the leader is r1. Again, there are many > > reconfiguration commands between its LEO and the end of the leader's log. > > It starts fetching, changing configurations as it goes until it reaches a > > stale configuration (r3, r4, r5) where it is a member but none of its > peers > > are actually alive anymore. It continues to fetch from the r1, but then > for > > some reason the connection to r1 is interrupted. r3 starts leader > elections > > which don't get responses. > > > > Example 3 > > Imagine a cluster of three voters: r1, r2, r3. Over time, many > > reconfigurations have happened and now the voters are (r4, r5, r6). The > > observer o1 starts fetching from the nodes in > > 'controller.quorum.bootstrap.servers' which includes r4. r4 responds > with a > > NotLeader and that r5 is the leader. o1 starts fetching and goes through > > the motion of switching to each configuration as it learns of it in the > > log. The connection to r5 gets interrupted while it is in the > configuration > > (r7, r8, r9). It attempts to fetch from these voters but none respond as > > they are all long dead, as this is a stale configuration. Does the > observer > > fallback to 'controller.quorum.bootstrap.servers' for its list of voters > it > > can fetch from? > > > > After thinking it through, it occurs to me that in examples 1 and 2, the > > leader (of the latest configuration) should be sending BeginQuorumEpoch > > requests to r3 after a certain timeout? r3 can start elections (based on > > its stale configuration) which will go nowhere, until it eventually > > receives a BeginQuorumEpoch from the leader and it will learn of the > leader > > and resume fetching. > > > > In the case of an observer, I suppose it must fallback to > > 'controller.quorum.voters' or 'controller.quorum.bootstrap.servers' to > > learn of the leader? > > > > Thanks > > Jack > > > > > > > > On Fri, Jan 26, 2024 at 1:36 AM José Armando García Sancio > > <jsan...@confluent.io.invalid> wrote: > > > > > Hi all, > > > > > > I have updated the KIP to include information on how KRaft controller > > > automatic joining will work. > > > > > > Thanks, > > > -- > > > -José > > > > > >