Re: [DISCUSS] KIP-853: KRaft Voters Change

José Armando García Sancio Fri, 22 Jul 2022 13:11:07 -0700

Jack Vanlightly wrote:
> - Regarding the removal of voters, when a leader appends a
> RemoveVoterRecord to its log, it immediately switches to the new
> configuration. There are two cases here:
>     1. The voter being removed is the leader itself. The KIP documents that
> the followers will continue to fetch from the leader despite these
> followers learning that the leader has been removed from the voter set.
> This is correct. It might be worth stating that the reason is that else the
> cluster can get blocked from making further progress as the followers must
> keep fetching from the leader else the leader cannot commit the record and
> resign.


Yes. The KIP now reads:
To allow this operation to be committed and for the leader to resign
the followers will continue to fetch from the leader even if the
leader is not part of the new voter set. In KRaft leader election is
triggered when the voter hasn't received a successful response in the
fetch timeout.

>     2. The voter being removed is not the leader. We should document that
> the leader will accept fetches with replicas ids (i.e. not observer
> fetches) from followers who are not in the voter set of the leader. This
> will occur because the voter being removed won't have received the
> RemoveVoteRecord yet and it must be allowed to reach this record so that:

In KRaft an observer is any replica with an (ID, UUID) or without an
(ID, UUID) that is not part of the voter set. I added a Key Terms
section with the definition of important terms that are used in the
KRaft implementation and KIPs. I'll add more terms to that section as
the discussion continues.

> - Regarding what happens when a voter leaves the voter set. When a
> non-leader voter reaches a RemoveVoterRecord where it is the subject of the
> removal, does it switch to being an observer and continue fetching? When a
> leader is removed, it carries on until it has committed the record or an
> election occurs electing a different leader. Until that point, it is
> leader but not a voter, so that makes it an observer? After it has
> committed the record and resigns (or gets deposed by an election) does it
> then start fetching as an observer?

Yeah. Good points. I was missing Handling sections for the
AddVoterRecord and RemoveVoterRecord sections. I added those sections
and they go into this detail. I should point out that observer is not
technically a state in KRaft. The better way to think about it is that
the voter set determines which states, specifically the candidate
state, a follower is allowed to transition to.

> - I think the Add/RemoveVoterRecords should also include the current voter
> set. This will make it easier to code against and also make
> troubleshooting easier. Else voter membership can only be reliably
> calculated by replaying the whole log.

Yeah. The reality of the implementation is that the replicas will have
to read the entire snapshot and log before they can determine the
voter set. I have concerns that by adding this field it will
unnecessarily complicate the snapshotting logic since it will have to
remember which AddVoterRecords were already appended to the snapshot.

I think we can make the topic partition snapshot and log more
debuggable by improving the kafka-metadata-shell. It is not part of
this KIP but I hope to write a future KIP that describes how the
kafka-metadata-shell displays information about the cluster metadata
partition.

> - Regarding the adding of voters:
>     1. We should document the problem of adding a new voter which then
> causes all progress to be blocked until the voter catches up with the
> leader. For example, in a 3 node cluster, we lose 1 node. We add a new node
> which means we have a majority = 3, with only 3 functioning nodes. Until
> the new node has caught up, the high watermark cannot advance. This can be
> solved by ensuring that to add a node we start it first as an observer and
> once it has caught up, send the AddVoter RPC to the leader. This leads to
> the question of how an observer determines that it has caught up.

Yes. I have the following in AddVoter Handling section:


--- Start of Section RPCs/AddVoter/Handling from KIP ---
When the leader receives an AddVoter request it will do the following:

1. Wait for the fetch offset of the replica (ID, UUID) to catch up to
the log end offset of the leader.
2. Wait for until there are no uncommitted add or remove voter records.
3. Append the AddVoterRecord to the log.
4. The KRaft internal listener will read this record from the log and
add the voter to the voter set.
5. Wait for the AddVoterRecord to commit using the majority of new
configuration.
6. Send the AddVoter response to the client.

In 1., the leader needs to wait for the replica to catch up because
when the AddVoterRecord is appended to the log the set of voter
changes. If the added voter is far behind then it can take some time
for it to reach the HWM. During this time the leader cannot commit
data and the quorum will be unavailable from the perspective of the
state machine. We mitigate this by waiting for the new replica to
catch up before adding it to the set of voters.

In 4., the new replica will be part of the quorum so the leader will
start sending BeginQuorumEpoch requests to this replica. It is
possible that the new replica has not yet replicated and applied this
AddVoterRecord so it doesn't know that it is a voter for this topic
partition. The new replica will fail this RPC until it discovers that
it is in the voter set. The leader will continue to retry until the
RPC succeeds.
--- End of Section RPCs/AddVoter/Handling from KIP ---

The leader will track the log end offset of the to-be-added replica by
tracking the offset and epoch sent in the Fetch and FetchSnapshot
requests. This is possible because the to-be-added replicas will send
their replica ID and UUID in those requests. The leader will not write
the AddVoterRecord to the log until the leader has determined that the
to-be-added replica has caught up.

I think that when a replica has caught up is an implementation detail
and we can have this detailed discussion in Jira or the PR. What do
you think?

>     2. Perhaps an administrator should send a Join RPC to the node we want
> to add. It will, if it isn't already, start fetching as an observer and
> automatically do the AddVoter RPC to the leader itself. This way from a
> human operations point-of-view, there is only a single action to perform.

I am proposing that to-be-added replicas will immediately start
replicating the log from the cluster metadata leader. This is how the
leader discovers them and is able to return them in the DescribeQuorum
RPC. At this point these replicas are observers, not voters. At any
point the operator can use the AddVoter RPC, Java Admin client or
"kafka-metadata-quorum --add-voter" to add voters to the cluster
metadata topic partition. This RPC will get sent to the current leader
if there is one. The leader handles this request as described in the
section I pasted above.

I like the idea of sending it directly to the leader and having the
response held until the AddVoterRecord commits. This makes it much
easier for an operator to programmatically add and remove voters as it
only needs to wait for a successful response from the cluster to know
that the voter has been added or removed. It doesn't need to poll the
cluster with DescribeQuorum to know when the voter has been added.

> - For any of the community that understand or are interested in TLA+,
> should we include the provisional TLA+ specification for this work?
> https://github.com/Vanlightly/raft-tlaplus/blob/main/specifications/pull-raft/KRaftWithReconfig.tla

Yes, Jack. Thanks a lot for your work in the TLA+ specification for
KRaft and this KIP. This was critical to getting this KIP done. I have
also added a link to your TLA+ specification to the Test Plan section.

Re: [DISCUSS] KIP-853: KRaft Voters Change

Reply via email to