Hi Jose,

Thanks for getting this started. Some comments:

- Regarding the removal of voters, when a leader appends a
RemoveVoterRecord to its log, it immediately switches to the new
configuration. There are two cases here:
    1. The voter being removed is the leader itself. The KIP documents that
the followers will continue to fetch from the leader despite these
followers learning that the leader has been removed from the voter set.
This is correct. It might be worth stating that the reason is that else the
cluster can get blocked from making further progress as the followers must
keep fetching from the leader else the leader cannot commit the record and
resign.
    2. The voter being removed is not the leader. We should document that
the leader will accept fetches with replicas ids (i.e. not observer
fetches) from followers who are not in the voter set of the leader. This
will occur because the voter being removed won't have received the
RemoveVoteRecord yet and it must be allowed to reach this record so that:
a) the RemoveVoterRecord can be committed (this voter may actually be
required in order to reach quorum on this record) and b) the voter being
removed can switch out of voter mode.
- Regarding what happens when a voter leaves the voter set. When a
non-leader voter reaches a RemoveVoterRecord where it is the subject of the
removal, does it switch to being an observer and continue fetching? When a
leader is removed, it carries on until it has committed the record or an
election occurs electing a different leader. Until that point, it is
leader but not a voter, so that makes it an observer? After it has
committed the record and resigns (or gets deposed by an election) does it
then start fetching as an observer?
- I think the Add/RemoveVoterRecords should also include the current voter
set. This will make it easier to code against and also make
troubleshooting easier. Else voter membership can only be reliably
calculated by replaying the whole log.
- Regarding the adding of voters:
    1. We should document the problem of adding a new voter which then
causes all progress to be blocked until the voter catches up with the
leader. For example, in a 3 node cluster, we lose 1 node. We add a new node
which means we have a majority = 3, with only 3 functioning nodes. Until
the new node has caught up, the high watermark cannot advance. This can be
solved by ensuring that to add a node we start it first as an observer and
once it has caught up, send the AddVoter RPC to the leader. This leads to
the question of how an observer determines that it has caught up.
    2. Perhaps an administrator should send a Join RPC to the node we want
to add. It will, if it isn't already, start fetching as an observer and
automatically do the AddVoter RPC to the leader itself. This way from a
human operations point-of-view, there is only a single action to perform.
- For any of the community that understand or are interested in TLA+,
should we include the provisional TLA+ specification for this work?
https://github.com/Vanlightly/raft-tlaplus/blob/main/specifications/pull-raft/KRaftWithReconfig.tla

Thanks
Jack


On Thu, Jul 21, 2022 at 7:38 PM Justine Olshan <jols...@confluent.io.invalid>
wrote:

> Hey Jose -- this seems like an important KIP! And I enjoy seeing more Uuid
> usage :)
> I was curious a bit more about the motivation here. That section seems to
> be missing.
>
> Thanks for sharing the KIP!
> Justine
>
> On Thu, Jul 21, 2022 at 10:30 AM Niket Goel <ng...@confluent.io.invalid>
> wrote:
>
> > Hey Jose,
> >
> > Thanks for the KIP. This is a good improvement and will make the KRaft
> > implementation much more robust in the face of failures and generally
> make
> > it more flexible for users.
> >
> > I did a first pass through the KIP and here are some comments (some of
> > these might just be a little uninformed, so feel free to direct me to
> > supplemental reading):
> > Overall protocol safety wise, the reconfiguration operations look safe.
> >
> > > This UUID will be generated once and persisted as part of the quorum
> > state for the topic partition
> > Do we mean that it will be generated every time the disk on the replica
> is
> > primed (so every disk incarnation will have UUID). I think you describe
> it
> > in a section further below. Best to pull it up to here — “the replica
> uuid
> > is automatically generated once by the replica when persisting the quorum
> > state for the first time.”
> >
> > > If there are any pending voter change operations the leader will wait
> > for them to finish.
> > Will new requests be rejected or queued up behind the pending operation.
> > (I am assuming rejected, but want to confirm)
> >
> > > When this option is used the leader of the KRaft topic partition will
> > not allow the AddVoter RPC to add replica IDs that are not describe in
> the
> > configuration and it would not allow the RemoveVoter RPC to remove
> replica
> > IDs that are described in the configuration.
> > Bootstrapping is a little tricky I think. Is it safer/simpler to say that
> > the any add/remove RPC operations are blocked until all nodes in the
> config
> > are processed? The way it is worded above makes it seem like the leader
> > will accept adds of the same node from outside. Is that the case?
> >
> > > The KRaft leader will not perform writes from the state machine (active
> > controller) until is has written to the log an AddVoterRecord for every
> > replica id in the controller.quorum.voters  configuration.
> > Just thinking through - One of the safety requirements for the protocol
> is
> > for a leader to commit at least one write in an epoch before doing config
> > changes, right? In this special case we should be ok because the quorum
> has
> > no one but the leader in the beginning. Is that the thought process?
> >
> > > controller.quorum.bootstrap.servers vs controller.quorum.voters
> > I understand the use of quorum.voters, but the bootstrap.servers is not
> > entirely clear to me. So in the example of starting the cluster with one
> > voter, will that one voter be listed here? And when using this
> > configuration, is the expectation that quorum.voters is empty, or will it
> > eventually get populated with the new quorum members? e.g. further in the
> > kip we say — “Replica 3 will discover the partition leader using
> > controller.quorum.voters”; so I guess it will be populated?
> >
> > > This check will be removed and replicas will reply to votes request
> when
> > the candidate is not in the voter set or the voting replica is not in the
> > voter set.
> > This is a major change IMO and I think it would be good if we could
> > somehow highlight it in the KIP to aid a future reader.
> >
> > > This also means that the KRaft implementation needs to handle this
> > uncommitted state getting truncated and reverted.
> > Do we need to talk about the specific behavior a little more here? I mean
> > how does this affect any inflight messages with quorums moving between
> > different values. (Just a brief except to why it works)
> >
> > > This state can be discovered by a client by using the DescribeQuorum
> > RPC, the Admin client or the kafka-quorum.sh CLI.
> > The describeQuorum RPC does not respond with a list of observers today.
> We
> > would need a section to fix that.
> >
> > > The client can now decide to add replica (3, UUID3') to the set of
> > voters using the AddVoter RPC, the Admin client or the kafka-quorum.sh
> CLI.
> > Trying the understand the general thought process‚ the addition of this
> > node back into the quorum will be a manually triggered operation and not
> > something the node will attempt by itself?
> >
> > This is a general wonderment, and feel free to ignore it:
> > Might be good to list some scenarios demonstrating the safety , e.g. how
> > do we ensure that there is no loss of availability during an addVoter
> > operation when the leader goes down. Or how a failed operation is safely
> > removed from the log and reverted.
> >
> > > On Jul 21, 2022, at 9:49 AM, José Armando García Sancio
> > <jsan...@confluent.io.invalid> wrote:
> > >
> > > Hi all,
> > >
> > > I would like to start the discussion on my design to support
> > > dynamically changing the set of voters in the KRaft cluster metadata
> > > topic partition.
> > >
> > > KIP URL:
> >
> https://www.google.com/url?q=https://cwiki.apache.org/confluence/x/nyH1D&source=gmail-imap&ust=1659026993000000&usg=AOvVaw12sPgdPT9X6LeINEVmj-iO
> > >
> > > Thanks!
> > > -José
> >
> >
>

Reply via email to