Hi Jack, see my comments below. On Thu, Feb 1, 2024 at 7:26 AM Jack Vanlightly <vanligh...@apache.org> wrote: > After thinking it through, it occurs to me that in examples 1 and 2, the > leader (of the latest configuration) should be sending BeginQuorumEpoch > requests to r3 after a certain timeout? r3 can start elections (based on > its stale configuration) which will go nowhere, until it eventually > receives a BeginQuorumEpoch from the leader and it will learn of the leader > and resume fetching. > > In the case of an observer, I suppose it must fallback to > 'controller.quorum.voters' or 'controller.quorum.bootstrap.servers' to > learn of the leader?
Great examples. The short answer is that yes, the leader needs to keep sending a BeingQuorumEpoch to a voter until the voter acknowledges it. In the current KRaft implementation, the KRaft leader already does this (the set of unacknowledged voters is tracked in LeaderState::nonAcknowledgeVoters and it used to send the BeginQuorumEpoch RPC). The leader continues to send the BeginQuorumEpoch RPC until a voter has acknowledged it. When the voter handles the BeginQuorumEpoch, it first persists the leader id, leader uuid (new in this KIP) and leader epoch to the quorum-state file before replying to the RPC. Since the leader's state is persisted before replying to the BeginQuorumEpoch RPC, one RPC and acknowledgement is enough. Note that if a replica loses its disk and quorum state it will come back with a different replica uuid and won't be part of the quorum. I think that the other important observation is that, in this KIP, KRaft will have two sets of endpoints: 1. voters set and 2. bootstrap servers. 1. The voters set can come from the log as you described or from controller.quorum.voters if the log doesn't contain any voters sets. 2. The bootstrap servers can come from controller.quorum.bootstrap.servers or controller.quorum.voters in that order of preference. The Fetch RPC will always be sent to the leader's endpoint if it is known. If the leader is not known, the Fetch RPC will be sent to one of the endpoints in the bootstrap server set. The Vote, BeginQuorumEpoch and EndQuorumEpoch will always be sent using the replicas and endpoints specified in the voters set. Your example also highlights the importance of "KIP-996: Pre-Vote" to avoid disruption to the quorum while lagging replicas catch up. Thanks for your feedback Jack. I'll update the KIP to make this clear. Thanks, -- -José