Re: [DISCUSS] KIP-853: KRaft Controller Membership Changes

Jun Rao Fri, 02 Feb 2024 10:55:29 -0800

Hi, Jose,

Thanks for the KIP. A few comments below.

10. kraft.version: Functionality wise, this seems very similar to
metadata.version, which is to make sure that all brokers/controllers are on
a supported version before enabling a new feature. Could you explain why we
need a new one instead of just relying on metadata.version?

11. Both the quorum-state file and controller.quorum.bootstrap.servers
contain endpoints. Which one takes precedence?

12. It seems that downgrading from this KIP is not supported? Could we have
a section to make it explicit?

13. controller.quorum.auto.join.enable: If this is set true, when does the
controller issue the addVoter RPC? Does it need to wait until it's caught
up? Does it issue the addVoter RPC on every restart?

14. "using the AddVoter RPC, the Admin client or the kafka-metadata-quorum
CLI.": In general, the operator doesn't know how to make RPC calls. So the
practical options are either CLI or adminClient.

15. VotersRecord: Why does it need to include name and SecurityProtocol in
EndPoints? It's meant to replace controller.quorum.voters, which only
includes host/port.

16. "The KRaft leader cannot do this for observers (brokers) since their
supported versions are not directly known by the KRaft leader."
Hmm, the KRaft leader receives BrokerRegistrationRequest that includes
supported feature versions, right?

17. UpdateVoter:
17.1 "The leader will learn the range of supported version from the
UpdateVoter RPC".
KIP-919 introduced ControllerRegistrationRequest to do that. Do we need a
new one?
17.2 Do we support changing the voter's endpoint dynamically? If not, it
seems that can be part of ControllerRegistrationRequest too.

18. AddVoter
18.1 "This RPC can be sent to a broker or controller, when sent to a
broker, the broker will forward the request to the active controller."
If it's sent to a non-active controller, it will also be forwarded to the
active controller, right?
18.2 Why do we need the name/security protocol fields in the request?
Currently, they can be derived from the configs.
    { "name": "Listeners", "type": "[]Listener", "versions": "0+",
      "about": "The endpoints that can be used to communicate with the
voter", "fields": [
      { "name": "Name", "type": "string", "versions": "0+", "mapKey": true,
        "about": "The name of the endpoint" },
      { "name": "Host", "type": "string", "versions": "0+",
        "about": "The hostname" },
      { "name": "Port", "type": "uint16", "versions": "0+",
        "about": "The port" },
      { "name": "SecurityProtocol", "type": "int16", "versions": "0+",
        "about": "The security protocol" }
    ]}
18.3 "4. Send an ApiVersions RPC to the first listener to discover the
supported kraft.version of the new voter."
Hmm, I thought that we found using ApiVersions unreliable (
https://issues.apache.org/jira/browse/KAFKA-15230) and therefore
introduced ControllerRegistrationRequest to propagate this information.
ControllerRegistrationRequest can be made at step 1 during catchup.
18.4 "In 4., the new replica will be part of the quorum so the leader will
start sending BeginQuorumEpoch requests to this replica."
Hmm, the leader should have sent BeginQuorumEpoch at step 1 so that the new
replica can catch up from it, right? Aslo, step 4 above only mentions
ApiVersions RPC, not BeginQuorumEpoch.

19. Vote: It's kind of weird that VoterUuid is at the partition level.
VoteId and VoterUuid uniquely identify a node, right? So it seems that it
should be at the same level as VoterId?

20. EndQuorumEpoch: "PreferredSuccessor which is an array is replica ids,
will be replaced by PreferredCandidates"
20.1 But PreferredSuccessor is still listed in the response.
20.2 Why does the response need to include endPoint?

21. "The KRaft implementation and protocol describe in KIP-595 and KIP-630
never read from the log or snapshot."
This seems confusing. The KIP-595 protocol does read the log to replicate
it, right?

22. "create a snapshot at 00000000000000000000-0000000000.checkpoint with
the necessary control records (KRaftVersionRecord and VotersRecord) to make
this Kafka node the only voter for the quorum."
22.1 It seems that currently, the bootstrap checkpoint is
named bootstrap.checkpoint. Are we changing it to
00000000000000000000-0000000000.checkpoint?
22.2 Just to be clear, records in the bootstrap snapshot will be propagated
to QuorumStateData, right?

23. kafka-metadata-quorum: In the output, we include the endpoints for the
voters, but not the observers. Why the inconsistency?

24. "The tool will set the TimoutMs for the AddVoterRequest to 30 seconds."
  Earlier, we have "The TimoutMs for the AddVoter RPC will be set to 10
minutes". Should we make them consistent?

25. "The replicas will update these order list of voters set whenever the
latest snapshot id increases, a VotersRecord control record is read from
the log and the log is truncated."
  Why will the voters set change when a new snapshot is created?

26. VotersRecord includes MinSupportedVersion and MaxSupportedVersion. In
FeatureLevelRecord, we only have one finalized version. Should we be
consistent?

27. The Raft dissertation mentioned the issue of disruptive servers after
they are removed. "Once the cluster leader has created the Cnew entry, a
server that is not in Cnew will no longer receive heartbeats, so it will
time out and start new elections."
  Have we addressed this issue?

28. There are quite a few typos in the KIP.
28.1 "the voters are the replica ID and UUID is in its own voters set
  Does not read well.
28.2 "used to configured:
  used to configure
28.3 "When at voter becomes a leader"
  when a voter
28.4 "write an VotersRecord controler"
  a VotersRecord; controller
28.5 "will get bootstrap"
  bootstrapped
28.6 "the leader of the KRaft cluster metadata leader"
  the leader of the KRaft cluster metadata partition
28.7 "until the call as been acknowledge"
  has been acknowledged
28.8 "As describe in KIP-595"
  described
28.9 "The KRaft implementation and protocol describe"
  described
28.10 "In other, the KRaft topic partition won't"
  In other words
28.11 "greater than their epic"
  epoch
28.12 "so their state will be tracking using their ID and UUID"
  tracked

Thanks,

Jun

On Thu, Feb 1, 2024 at 7:27 AM Jack Vanlightly <vanligh...@apache.org>
wrote:

> Hi Jose,
>
> I have a question about how voters and observers, which are far behind the
> leader, catch-up when there are multiple reconfiguration commands in the
> log between their position and the end of the log.
>
> Here are some example situations that need clarification:
>
> Example 1
> Imagine a cluster of three voters: r1, r2, r3. Voter r3 goes offline for a
> while. In the meantime, r1 dies and gets replaced with r4, and r2 dies
> getting replaced with r5. Now the cluster is formed of r3, r4, r5. When r3
> comes back online, it tries to fetch from dead nodes and finally starts
> unending leader elections - stuck because it doesn't realise it's in a
> stale configuration whose members are all dead except for itself.
>
> Example 2
> Imagine a cluster of three voters: r1, r2, r3. Voter r3 goes offline then
> comes back and discovers the leader is r1. Again, there are many
> reconfiguration commands between its LEO and the end of the leader's log.
> It starts fetching, changing configurations as it goes until it reaches a
> stale configuration (r3, r4, r5) where it is a member but none of its peers
> are actually alive anymore. It continues to fetch from the r1, but then for
> some reason the connection to r1 is interrupted. r3 starts leader elections
> which don't get responses.
>
> Example 3
> Imagine a cluster of three voters: r1, r2, r3. Over time, many
> reconfigurations have happened and now the voters are (r4, r5, r6). The
> observer o1 starts fetching from the nodes in
> 'controller.quorum.bootstrap.servers' which includes r4. r4 responds with a
> NotLeader and that r5 is the leader. o1 starts fetching and goes through
> the motion of switching to each configuration as it learns of it in the
> log. The connection to r5 gets interrupted while it is in the configuration
> (r7, r8, r9). It attempts to fetch from these voters but none respond as
> they are all long dead, as this is a stale configuration. Does the observer
> fallback to 'controller.quorum.bootstrap.servers' for its list of voters it
> can fetch from?
>
> After thinking it through, it occurs to me that in examples 1 and 2, the
> leader (of the latest configuration) should be sending BeginQuorumEpoch
> requests to r3 after a certain timeout? r3 can start elections (based on
> its stale configuration) which will go nowhere, until it eventually
> receives a BeginQuorumEpoch from the leader and it will learn of the leader
> and resume fetching.
>
> In the case of an observer, I suppose it must fallback to
> 'controller.quorum.voters' or  'controller.quorum.bootstrap.servers' to
> learn of the leader?
>
> Thanks
> Jack
>
>
>
> On Fri, Jan 26, 2024 at 1:36 AM José Armando García Sancio
> <jsan...@confluent.io.invalid> wrote:
>
> > Hi all,
> >
> > I have updated the KIP to include information on how KRaft controller
> > automatic joining will work.
> >
> > Thanks,
> > --
> > -José
> >
>

Re: [DISCUSS] KIP-853: KRaft Controller Membership Changes

Reply via email to