Re: [DISCUSS] KIP-853: KRaft Controller Membership Changes

Luke Chen Wed, 28 Feb 2024 04:34:12 -0800

> 2. After "RemoveVoter", what is the role of the node?
> It looks like after the voter got removed from the voter set, it is not a
> voter anymore. But I think it can still fetch with the leader. So it
should
> be an observer, with a "process.role=controller"? And if the node was
> originally "process.role=controller,broker", it'll become a broker-only
> node?


> Kafka nodes need to allow for controllers that are not voters. I don't
expect too many issues from an implementation point of view. Most of
it may just be aggressive validation in KafkaConfig. I think the
easier way to explain this state is that there will be controllers
that will never become active controllers. If we want, we can have a
monitor that turns on (1) if a node is in this state. What do you
think?

I agree we have a way for users to monitor the node state, like when does
the controller completed the voter removal ( so that it is safe to be
shutdown), or when does the controller completed the voter addition (so
that users can start to add another controller), etc.

10. controller.quorum.voters:
This is an existing configuration. This configuration describes the state
of the quorum and will only be used if the kraft.version feature is 0.
> From the discussion, it looks like even if the kraft.version is 1, we
still first check the `controller.quorum.voters` if
`controller.quorum.bootstrap.servers` is not set. Is that correct? If so,
maybe we need to update the description?

11. When a controller starts up before joining as a voter, it'll be an
observer. In this case, will it be shown in the observer field of
`kafka-metadata-quorum describe --status`? Same question to a controller
after getting removed.

12. What will happen if there's only 1 voter and user still tries to remove
the voter? Any error returned?

Thanks.
Luke



On Thu, Jan 25, 2024 at 7:50 AM José Armando García Sancio
<jsan...@confluent.io.invalid> wrote:

> Thanks for the feedback Luke. See my comments below:
>
> On Wed, Jan 24, 2024 at 4:20 AM Luke Chen <show...@gmail.com> wrote:
> > 1. About "VotersRecord":
> >
> > > When a KRaft voter becomes leader it will write a KRaftVersionRecord
> and
> > VotersRecord to the log if the log or the latest snapshot doesn't contain
> > any VotersRecord. This is done to make sure that the voter set in the
> > bootstrap snapshot gets replicated to all of the voters and to not rely
> on
> > all of the voters being configured with the same bootstrapped voter set.
> >
> > > This record will also be written to the log if it has never been
> written
> > to the log in the past. This semantic is nice to have to consistently
> > replicate the bootstrapping snapshot, at
> > 00000000000000000000-0000000000.checkpoint, of the leader to all of the
> > voters.
> >
> > If the `VotersRecord` has written into
> > 00000000000000000000-0000000000.checkpoint,
> > later, a new voter added. Will we write a new checkpoint to the file?
> > If so, does that mean the `metadata.log.max.snapshot.interval.ms` will
> be
> > ignored?
>
> KRaft (KafkaRaftClient) won't initiate the snapshot generation. The
> snapshot generation will be initiated by the state machine (controller
> or broker) using the RaftClient::createSnapshot method. When the state
> machine calls into RaftClient::createSnapshot the KafkaRaftClient will
> compute the set of voters at the provided offset and epoch, and write
> the VotersRecord after the SnapshotHeaderRecord. This does mean that
> the KafkaRaftClient needs to store in memory all of the voter set
> configurations between the RaftClient::latestSnapshotId and the LEO
> for the KRaft partition.
>
> > If not, then how could we make sure the voter set in the bootstrap
> snapshot
> > gets replicated to all of the voters and to not rely on all of the voters
> > being configured with the same bootstrapped voter set?
>
> I think my answer above should answer your question. VoterRecord-s
> will be in the log (log segments) and the snapshots so they will be
> replicated by Fetch and FetchSnapshot. When the voter set is changed
> or bootstrapped, the leader will write the VotersRecord to the log
> (active log segment). When the state machine (controller or broker)
> asks to create a snapshot, KRaft will write the VotersRecord at the
> start to the snapshot after the SnapshotHeaderRecord.
>
> > 2. After "RemoveVoter", what is the role of the node?
> > It looks like after the voter got removed from the voter set, it is not a
> > voter anymore. But I think it can still fetch with the leader. So it
> should
> > be an observer, with a "process.role=controller"? And if the node was
> > originally "process.role=controller,broker", it'll become a broker-only
> > node?
>
> Kafka nodes need to allow for controllers that are not voters. I don't
> expect too many issues from an implementation point of view. Most of
> it may just be aggressive validation in KafkaConfig. I think the
> easier way to explain this state is that there will be controllers
> that will never become active controllers. If we want, we can have a
> monitor that turns on (1) if a node is in this state. What do you
> think?
>
> > 3. "UpdateVoter" RPC V.S. "ControllerRegistration" RPC?
> > To me, the purpose of "UpdateVoter" is basically identical with
> > "ControllerRegistration", to register the voter info in the active
> > controller/metadata. Should we re-use the existing
> "ControllerRegistration"
> > one?
>
> Yeah. I thought about this option. There are two issues here.
> 1) The semantics are a bit different. I want the UpdateVoter RPC to
> only update the endpoints and kraft.version. I don't want the RPC to
> add the controller to the voter set. ControllerRegistration RPC is an
> upsert. It creates a registration if it doesn't exist or it updates a
> registration if one already exists.
>
> 2) These are two different layers. Metadata is the application layers.
> KRaft is the consensus (control) layer. To give a precise example, the
> controller is determining if it should send a
> ControllerRegistrationRequest based on the state of the cluster
> metadata (ControllerRegistrationRecord) while KRaft needs to send this
> information based on the state of the VotesRecord control record. It
> is beneficial to keep these layers separate even if they duplicate
> information as it leads to a better implementation that we can test,
> simulate and verify.
>
> > 4. > When the kraft.version is upgraded to a version greater 0 and the
> > version is supported by the voters, the leader will write a control
> record
> > batch that will include the kraft.version record and all of the AddVoter
> > records for all of the voters...
> >
> > When can't we just write a "VoterRecord" including all voters?
>
> Yes. I need to update the KIP to remove the AddVoterRecord and
> RemoveVoterRecord control records. Jason had a similar question and I
> provided this answer just to give some context on the current state:
>
> "Yes. We can remove them and I'll remove them. For context, I
> originally only had AddVoterRecord and RemoveVoterRecord. When
> fleshing out the details of the bootstrapping process, I realized that
> I needed a control record that would override the voter set and not do
> an incremental change. I needed this functionality because there is no
> guarantee that the content of the bootstrap checkpoint
> (...0000-...0000.checkpoint) matches for all of the voters.
>
> After I added the VotersRecord, my thinking for keeping AddVoterRecord
> and RemoveVoterRecord was to make it explicitly that the protocol only
> allows changing one voter at a time. I can instead write a comparison
> function that KRaft can use whenever it attempts to write or read a
> VotersRecord. The comparison function would fail if all of the
> possible majorities of the old voter set don't intersect with all of
> the possible majority of the new voter set."
>
> > 5. In "LeaderChangeMessage", You said:
> > > Add an optional VoterUuid to Voter. This change is not needed for
> > correctness but it is nice to have for tracing and debugging. The leader
> > will write version 0 if the kraft.version is 0. The leader will write
> > version 1 if the kraft.version is 1.
> >
> > And in the code change below:
> > +      { "name": "VoterUuid", "type": "int32", "versions": "1+" }
> > So, a. this VoterUuid field is a int32 field?
>
> No, it is not. It is a uuid. Let me fix that. Note that this value
> comes from the directory.id in meta.properties which was introduced by
> the JBOD KIP.
>
> > b. Where do we write "Kraft.version" to?
>
> The kraft.version is written in the KRaftVersionRecord control record
> This record is first written by the KRaft leader to the log (active
> segment). This record is also included in the snapshot and will be
> added by KRaft when the state machine (controller and broker) calls
> RaftClient::createSnapshot.
>
> If the control record doesn't exist (in the partition; log or
> snapshot), the kraft.version is 0. If it exists then the kraft.version
> is the latest version specified in the KRaftVersion field.
>
> > 6. In the "UpdateVoter", the handling section contains many `AddVoter`
> > names, which is confusing.
>
> Okay. Let me review that section and fix it.
>
> > 7. About this:
> > > This KIP requires the KRaft implementation now read uncommitted data
> from
> > log and snapshot to discover the voter set. This also means that the
> KRaft
> > implementation needs to handle this uncommitted state getting truncated
> and
> > reverted.
> >
> > I understand why this is required now, but I imagine it will make the
> > protocol more complicated. Could you elaborate more about how we're going
> > to handle the uncommitted data in the KIP?
>
> I don't think it should complicate the protocol. It will complicate
> the implementation. I suspect that my first implementation will just
> keep the voter sets in-memory from the latest snapshot to the LEO. If
> the KafkaRaftClient needs to truncate the log then it will also remove
> from heap memory the voter sets that were truncated.
>
> Note that the latest snapshot id is less than or equal to the HWM
> which is less than or equal to the LEO. All of these offsets are
> exclusive. In the rare case (data-loss) that the KafkaRaftClient
> truncates beyond the latest snapshot, then the raft client will
> reconstruct the entire voter sets state by reading the latest snapshot
> and the control records after that snapshot. When reading the latest
> snapshot the KRaft only needs to read the first few control records
> (SnapshotHeaderRecord, KRaftVersionRecord and VotersRecord).
>
> Thanks for the feedback Luke. It is greatly appreciated. Let me make
> all of the suggested changes and I'll ping the discussion thread when
> I am finished.
> --
> -José
>

Re: [DISCUSS] KIP-853: KRaft Controller Membership Changes

Reply via email to