> 2. After "RemoveVoter", what is the role of the node? > It looks like after the voter got removed from the voter set, it is not a > voter anymore. But I think it can still fetch with the leader. So it should > be an observer, with a "process.role=controller"? And if the node was > originally "process.role=controller,broker", it'll become a broker-only > node?
> Kafka nodes need to allow for controllers that are not voters. I don't expect too many issues from an implementation point of view. Most of it may just be aggressive validation in KafkaConfig. I think the easier way to explain this state is that there will be controllers that will never become active controllers. If we want, we can have a monitor that turns on (1) if a node is in this state. What do you think? I agree we have a way for users to monitor the node state, like when does the controller completed the voter removal ( so that it is safe to be shutdown), or when does the controller completed the voter addition (so that users can start to add another controller), etc. 10. controller.quorum.voters: This is an existing configuration. This configuration describes the state of the quorum and will only be used if the kraft.version feature is 0. > From the discussion, it looks like even if the kraft.version is 1, we still first check the `controller.quorum.voters` if `controller.quorum.bootstrap.servers` is not set. Is that correct? If so, maybe we need to update the description? 11. When a controller starts up before joining as a voter, it'll be an observer. In this case, will it be shown in the observer field of `kafka-metadata-quorum describe --status`? Same question to a controller after getting removed. 12. What will happen if there's only 1 voter and user still tries to remove the voter? Any error returned? Thanks. Luke On Thu, Jan 25, 2024 at 7:50 AM José Armando García Sancio <jsan...@confluent.io.invalid> wrote: > Thanks for the feedback Luke. See my comments below: > > On Wed, Jan 24, 2024 at 4:20 AM Luke Chen <show...@gmail.com> wrote: > > 1. About "VotersRecord": > > > > > When a KRaft voter becomes leader it will write a KRaftVersionRecord > and > > VotersRecord to the log if the log or the latest snapshot doesn't contain > > any VotersRecord. This is done to make sure that the voter set in the > > bootstrap snapshot gets replicated to all of the voters and to not rely > on > > all of the voters being configured with the same bootstrapped voter set. > > > > > This record will also be written to the log if it has never been > written > > to the log in the past. This semantic is nice to have to consistently > > replicate the bootstrapping snapshot, at > > 00000000000000000000-0000000000.checkpoint, of the leader to all of the > > voters. > > > > If the `VotersRecord` has written into > > 00000000000000000000-0000000000.checkpoint, > > later, a new voter added. Will we write a new checkpoint to the file? > > If so, does that mean the `metadata.log.max.snapshot.interval.ms` will > be > > ignored? > > KRaft (KafkaRaftClient) won't initiate the snapshot generation. The > snapshot generation will be initiated by the state machine (controller > or broker) using the RaftClient::createSnapshot method. When the state > machine calls into RaftClient::createSnapshot the KafkaRaftClient will > compute the set of voters at the provided offset and epoch, and write > the VotersRecord after the SnapshotHeaderRecord. This does mean that > the KafkaRaftClient needs to store in memory all of the voter set > configurations between the RaftClient::latestSnapshotId and the LEO > for the KRaft partition. > > > If not, then how could we make sure the voter set in the bootstrap > snapshot > > gets replicated to all of the voters and to not rely on all of the voters > > being configured with the same bootstrapped voter set? > > I think my answer above should answer your question. VoterRecord-s > will be in the log (log segments) and the snapshots so they will be > replicated by Fetch and FetchSnapshot. When the voter set is changed > or bootstrapped, the leader will write the VotersRecord to the log > (active log segment). When the state machine (controller or broker) > asks to create a snapshot, KRaft will write the VotersRecord at the > start to the snapshot after the SnapshotHeaderRecord. > > > 2. After "RemoveVoter", what is the role of the node? > > It looks like after the voter got removed from the voter set, it is not a > > voter anymore. But I think it can still fetch with the leader. So it > should > > be an observer, with a "process.role=controller"? And if the node was > > originally "process.role=controller,broker", it'll become a broker-only > > node? > > Kafka nodes need to allow for controllers that are not voters. I don't > expect too many issues from an implementation point of view. Most of > it may just be aggressive validation in KafkaConfig. I think the > easier way to explain this state is that there will be controllers > that will never become active controllers. If we want, we can have a > monitor that turns on (1) if a node is in this state. What do you > think? > > > 3. "UpdateVoter" RPC V.S. "ControllerRegistration" RPC? > > To me, the purpose of "UpdateVoter" is basically identical with > > "ControllerRegistration", to register the voter info in the active > > controller/metadata. Should we re-use the existing > "ControllerRegistration" > > one? > > Yeah. I thought about this option. There are two issues here. > 1) The semantics are a bit different. I want the UpdateVoter RPC to > only update the endpoints and kraft.version. I don't want the RPC to > add the controller to the voter set. ControllerRegistration RPC is an > upsert. It creates a registration if it doesn't exist or it updates a > registration if one already exists. > > 2) These are two different layers. Metadata is the application layers. > KRaft is the consensus (control) layer. To give a precise example, the > controller is determining if it should send a > ControllerRegistrationRequest based on the state of the cluster > metadata (ControllerRegistrationRecord) while KRaft needs to send this > information based on the state of the VotesRecord control record. It > is beneficial to keep these layers separate even if they duplicate > information as it leads to a better implementation that we can test, > simulate and verify. > > > 4. > When the kraft.version is upgraded to a version greater 0 and the > > version is supported by the voters, the leader will write a control > record > > batch that will include the kraft.version record and all of the AddVoter > > records for all of the voters... > > > > When can't we just write a "VoterRecord" including all voters? > > Yes. I need to update the KIP to remove the AddVoterRecord and > RemoveVoterRecord control records. Jason had a similar question and I > provided this answer just to give some context on the current state: > > "Yes. We can remove them and I'll remove them. For context, I > originally only had AddVoterRecord and RemoveVoterRecord. When > fleshing out the details of the bootstrapping process, I realized that > I needed a control record that would override the voter set and not do > an incremental change. I needed this functionality because there is no > guarantee that the content of the bootstrap checkpoint > (...0000-...0000.checkpoint) matches for all of the voters. > > After I added the VotersRecord, my thinking for keeping AddVoterRecord > and RemoveVoterRecord was to make it explicitly that the protocol only > allows changing one voter at a time. I can instead write a comparison > function that KRaft can use whenever it attempts to write or read a > VotersRecord. The comparison function would fail if all of the > possible majorities of the old voter set don't intersect with all of > the possible majority of the new voter set." > > > 5. In "LeaderChangeMessage", You said: > > > Add an optional VoterUuid to Voter. This change is not needed for > > correctness but it is nice to have for tracing and debugging. The leader > > will write version 0 if the kraft.version is 0. The leader will write > > version 1 if the kraft.version is 1. > > > > And in the code change below: > > + { "name": "VoterUuid", "type": "int32", "versions": "1+" } > > So, a. this VoterUuid field is a int32 field? > > No, it is not. It is a uuid. Let me fix that. Note that this value > comes from the directory.id in meta.properties which was introduced by > the JBOD KIP. > > > b. Where do we write "Kraft.version" to? > > The kraft.version is written in the KRaftVersionRecord control record > This record is first written by the KRaft leader to the log (active > segment). This record is also included in the snapshot and will be > added by KRaft when the state machine (controller and broker) calls > RaftClient::createSnapshot. > > If the control record doesn't exist (in the partition; log or > snapshot), the kraft.version is 0. If it exists then the kraft.version > is the latest version specified in the KRaftVersion field. > > > 6. In the "UpdateVoter", the handling section contains many `AddVoter` > > names, which is confusing. > > Okay. Let me review that section and fix it. > > > 7. About this: > > > This KIP requires the KRaft implementation now read uncommitted data > from > > log and snapshot to discover the voter set. This also means that the > KRaft > > implementation needs to handle this uncommitted state getting truncated > and > > reverted. > > > > I understand why this is required now, but I imagine it will make the > > protocol more complicated. Could you elaborate more about how we're going > > to handle the uncommitted data in the KIP? > > I don't think it should complicate the protocol. It will complicate > the implementation. I suspect that my first implementation will just > keep the voter sets in-memory from the latest snapshot to the LEO. If > the KafkaRaftClient needs to truncate the log then it will also remove > from heap memory the voter sets that were truncated. > > Note that the latest snapshot id is less than or equal to the HWM > which is less than or equal to the LEO. All of these offsets are > exclusive. In the rare case (data-loss) that the KafkaRaftClient > truncates beyond the latest snapshot, then the raft client will > reconstruct the entire voter sets state by reading the latest snapshot > and the control records after that snapshot. When reading the latest > snapshot the KRaft only needs to read the first few control records > (SnapshotHeaderRecord, KRaftVersionRecord and VotersRecord). > > Thanks for the feedback Luke. It is greatly appreciated. Let me make > all of the suggested changes and I'll ping the discussion thread when > I am finished. > -- > -José >