Hi David, Thanks for the response. But I don't think the LEO-based leader election only benefit for this case. Like in unclean clear election case, we now randomly chose a out-of-sync replica to become the leader. This LEO-based leader election will help this case, too. Besides, not all producers use `acks=all`, thus, when using `acks=0` or `acks=1`, they can also benefit from LEO-based leader election.
That's why I think this could be in a separate KIP, and after that's introduced, this KIP will be a further improvement based on that. Does that make sense? Actually, I don't strongly insist in doing this, if you still think they should be proposed together, I can update the KIP, too. Thank you. Luke On Fri, May 12, 2023 at 4:48 PM David Jacot <dja...@confluent.io.invalid> wrote: > Hi Luke, > > I disagree with this because we don't need the leader election change on > its own if we don't do this KIP. They have to go together or not at all in > my opinion. We need a KIP which designs the entire solution. > > Best, > David > > On Fri, May 12, 2023 at 10:33 AM Luke Chen <show...@gmail.com> wrote: > > > Hi Alexandre, > > > > Thanks for the thoughts. > > I've thought about it, and think I would choose to have a new leader > > election method to fix the problem we encountered, not this "backup-only" > > replica solution. > > But this is still an interesting idea. Like what you've said, this > solution > > can bring many benefits. > > So maybe you can create a proposal for it? > > > > Thank you. > > Luke > > > > On Fri, May 12, 2023 at 4:21 PM Luke Chen <show...@gmail.com> wrote: > > > > > Hi Haruki, > > > > > > Yes, this scenario could happen. > > > I'm thinking we can fix it in step 6, when controller tried to get LEO > > > from B,C replicas, the B,C replica should stop fetcher for this > partition > > > immediately, before returning the LEO. > > > About if we need quorum-based or not, We can discuss in another KIP. > I'm > > > still thinking about it. > > > > > > Thank you. > > > Luke > > > > > > > > > On Fri, May 12, 2023 at 3:59 PM Luke Chen <show...@gmail.com> wrote: > > > > > >> Hi David, > > >> > > >> > It can't be in another KIP as it is required for your proposal to > > work. > > >> This is also an important part to discuss as it requires the > controller > > to > > >> do more operations on leader changes. > > >> > > >> Yes, I know this is a requirement for this KIP to work, and need a lot > > of > > >> discussion. > > >> So that's why I think it'd be better to have a separate KIP to write > the > > >> content and discussion. > > >> I've put the status of this KIP as "pending" and added a note on the > top > > >> of this KIP: > > >> > > >> Note: This KIP requires leader election change, which will be proposed > > in > > >> another KIP. > > >> > > >> Thanks. > > >> Luke > > >> > > >> On Thu, May 11, 2023 at 11:43 PM Alexandre Dupriez < > > >> alexandre.dupr...@gmail.com> wrote: > > >> > > >>> Hi, Luke, > > >>> > > >>> Thanks for your reply. > > >>> > > >>> 102. Whether such a replica could become a leader depends on what the > > >>> end-user wants to use it for and what tradeoffs they wish to make > down > > >>> the line. > > >>> > > >>> There are cases, for instance with heterogeneous or interregional > > >>> networks, where the difference in latency between subsets of brokers > > >>> can be high enough for the "slow replicas" to have a detrimental > > >>> impact on the ISR traffic they take part in. This can justify > > >>> permanently segregating them from ISR traffic by design. And, an > > >>> end-user could still prefer to have these "slow replicas" versus > > >>> alternative approaches such as mirroring for the benefits they can > > >>> bring, for instance: a) they belong to the same cluster with no added > > >>> admin and ops, b) benefit from a direct, simpler replication path, c) > > >>> require less infrastructure than a mirrored solution, d) could become > > >>> unclean leaders for failovers under disaster scenarios such as a > > >>> regional service outages. > > >>> > > >>> Thanks, > > >>> Alexandre > > >>> > > >>> Le jeu. 11 mai 2023 à 14:57, Haruki Okada <ocadar...@gmail.com> a > > écrit > > >>> : > > >>> > > > >>> > Hi, Luke. > > >>> > > > >>> > Though this proposal definitely looks interesting, as others > pointed > > >>> out, > > >>> > the leader election implementation would be the hard part. > > >>> > > > >>> > And I think even LEO-based-election is not safe, which could cause > > >>> silent > > >>> > committed-data loss easily. > > >>> > > > >>> > Let's say we have replicas A,B,C and A is the leader initially, and > > >>> > min.insync.replicas = 2. > > >>> > > > >>> > - 1. Initial > > >>> > * A(leo=0), B(leo=0), C(leo=0) > > >>> > - 2. Produce a message to A > > >>> > * A(leo=1), B(leo=0), C(leo=0) > > >>> > - 3. Another producer produces a message to A (i.e. as the > different > > >>> batch) > > >>> > * A(leo=2), B(leo=0), C(leo=0) > > >>> > - 4. C replicates the first batch. offset=1 is committed (by > > >>> > acks=min.insync.replicas) > > >>> > * A(leo=2), B(leo=0), C(leo=1) > > >>> > - 5. A loses ZK session (or broker session timeout in KRaft) > > >>> > - 6. Controller (regardless ZK/KRaft) doesn't store LEO in itself, > so > > >>> it > > >>> > needs to interact with each replica. It detects C has the largest > LEO > > >>> and > > >>> > decided to elect C as the new leader > > >>> > - 7. Before leader-election is performed, B replicates offset=1,2 > > from > > >>> A. > > >>> > offset=2 is committed > > >>> > * This is possible because even if A lost ZK session, A could > > >>> handle > > >>> > fetch requests for a while. > > >>> > - 8. Controller elects C as the new leader. B truncates its offset. > > >>> > offset=2 is lost silently. > > >>> > > > >>> > I have a feeling that we need quorum-based data replication? as > Divij > > >>> > pointed out. > > >>> > > > >>> > > > >>> > 2023年5月11日(木) 22:33 David Jacot <dja...@confluent.io.invalid>: > > >>> > > > >>> > > Hi Luke, > > >>> > > > > >>> > > > Yes, on second thought, I think the new leader election is > > >>> required to > > >>> > > work > > >>> > > for this new acks option. I'll think about it and open another > KIP > > >>> for it. > > >>> > > > > >>> > > It can't be in another KIP as it is required for your proposal to > > >>> work. > > >>> > > This is also an important part to discuss as it requires the > > >>> controller to > > >>> > > do more operations on leader changes. > > >>> > > > > >>> > > Cheers, > > >>> > > David > > >>> > > > > >>> > > On Thu, May 11, 2023 at 2:44 PM Luke Chen <show...@gmail.com> > > wrote: > > >>> > > > > >>> > > > Hi Ismael, > > >>> > > > Yes, on second thought, I think the new leader election is > > >>> required to > > >>> > > work > > >>> > > > for this new acks option. I'll think about it and open another > > KIP > > >>> for > > >>> > > it. > > >>> > > > > > >>> > > > Hi Divij, > > >>> > > > Yes, I agree with all of them. I'll think about it and let you > > >>> know how > > >>> > > we > > >>> > > > can work together. > > >>> > > > > > >>> > > > Hi Alexandre, > > >>> > > > > 100. The KIP makes one statement which may be considered > > >>> critical: > > >>> > > > "Note that in acks=min.insync.replicas case, the slow follower > > >>> might > > >>> > > > be easier to become out of sync than acks=all.". Would you have > > >>> some > > >>> > > > data on that behaviour when using the new ack semantic? It > would > > be > > >>> > > > interesting to analyse and especially look at the percentage of > > >>> time > > >>> > > > the number of replicas in ISR is reduced to the configured > > >>> > > > min.insync.replicas. > > >>> > > > > > >>> > > > The comparison data would be interesting. I can have a test > when > > >>> > > available. > > >>> > > > But this KIP will be deprioritized because there should be a > > >>> > > pre-requisite > > >>> > > > KIP for it. > > >>> > > > > > >>> > > > > A (perhaps naive) hypothesis would be that the > > >>> > > > new ack semantic indeed provides better produce latency, but at > > the > > >>> > > > cost of precipitating the slowest replica(s) out of the ISR? > > >>> > > > > > >>> > > > Yes, it could be. > > >>> > > > > > >>> > > > > 101. I understand the impact on produce latency, but I am not > > >>> sure > > >>> > > > about the impact on durability. Is your durability model built > > >>> against > > >>> > > > the replication factor or the number of min insync replicas? > > >>> > > > > > >>> > > > Yes, and also the new LEO-based leader election (not proposed > > yet). > > >>> > > > > > >>> > > > > 102. Could a new type of replica which would not be allowed > to > > >>> enter > > >>> > > > the ISR be an alternative? Such replica could attempt > replication > > >>> on a > > >>> > > > best-effort basis and would provide the permanent guarantee not > > to > > >>> > > > interfere with foreground traffic. > > >>> > > > > > >>> > > > You mean a backup replica, which will never become leader > > >>> (in-sync), > > >>> > > right? > > >>> > > > That's an interesting thought and might be able to become a > > >>> workaround > > >>> > > with > > >>> > > > the existing leader election. Let me think about it. > > >>> > > > > > >>> > > > Hi qiangLiu, > > >>> > > > > > >>> > > > > It's a good point that add this config and get better P99 > > >>> latency, but > > >>> > > is > > >>> > > > this changing the meaning of "in sync replicas"? consider a > > >>> situation > > >>> > > with > > >>> > > > "replica=3 acks=2", when two broker fail and left only the > broker > > >>> that > > >>> > > > does't have the message, it is in sync, so will be elected as > > >>> leader, > > >>> > > will > > >>> > > > it cause a NOT NOTICED lost of acked messages? > > >>> > > > > > >>> > > > Yes, it will, so the `min.insync.replicas` config in the > > >>> broker/topic > > >>> > > level > > >>> > > > should be set correctly. In your example, it should be set to > 2, > > >>> so that > > >>> > > > when 2 replicas down, no new message write will be successful. > > >>> > > > > > >>> > > > > > >>> > > > Thank you. > > >>> > > > Luke > > >>> > > > > > >>> > > > > > >>> > > > On Thu, May 11, 2023 at 12:16 PM 67 <6...@gd67.com> wrote: > > >>> > > > > > >>> > > > > Hi Luke, > > >>> > > > > > > >>> > > > > > > >>> > > > > It's a good point that add this config and get better P99 > > >>> latency, but > > >>> > > is > > >>> > > > > this changing the meaning of "in sync replicas"? consider a > > >>> situation > > >>> > > > with > > >>> > > > > "replica=3 acks=2", when two broker fail and left only the > > >>> broker that > > >>> > > > > does't have the message, it is in sync, so will be elected as > > >>> leader, > > >>> > > > will > > >>> > > > > it cause a *NOT NOTICED* lost of acked messages? > > >>> > > > > > > >>> > > > > qiangLiu > > >>> > > > > > > >>> > > > > > > >>> > > > > 在2023年05月10 12时58分,"Ismael Juma"<ism...@juma.me.uk>写道: > > >>> > > > > > > >>> > > > > > > >>> > > > > Hi Luke, > > >>> > > > > > > >>> > > > > As discussed in the other KIP, there are some subtleties when > > it > > >>> comes > > >>> > > to > > >>> > > > > the semantics of the system if we don't wait for all members > of > > >>> the isr > > >>> > > > > before we ack. I don't understand why you say the leader > > election > > >>> > > > question > > >>> > > > > is out of scope - it seems to be a core aspect to me. > > >>> > > > > > > >>> > > > > Ismael > > >>> > > > > > > >>> > > > > > > >>> > > > > On Wed, May 10, 2023, 8:50 AM Luke Chen <show...@gmail.com> > > >>> wrote: > > >>> > > > > > > >>> > > > > > Hi Ismael, > > >>> > > > > > > > >>> > > > > > No, I didn't know about this similar KIP! I hope I've known > > >>> that so > > >>> > > > that I > > >>> > > > > > don't need to spend time to write it again! :( > > >>> > > > > > I checked the KIP and all the discussions (here > > >>> > > > > > < > > >>> https://lists.apache.org/list?dev@kafka.apache.org:gte=100d:KIP-250 > > >>> > > >). > > >>> > > > I > > >>> > > > > > think the consensus is that adding a client config to > > >>> `acks=quorum` > > >>> > > is > > >>> > > > > > fine. > > >>> > > > > > This comment > > >>> > > > > > < > > >>> https://lists.apache.org/thread/p77pym5sxpn91r8j364kmmf3qp5g65rn> > > >>> > > > from > > >>> > > > > > Guozhang pretty much concluded what I'm trying to do. > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > *1. Add one more value to client-side acks config: 0: no > > acks > > >>> > > needed > > >>> > > > at > > >>> > > > > > all. 1: ack from the leader. all: ack from ALL the ISR > > >>> replicas > > >>> > > > > > quorum: this is the new value, it requires ack from enough > > >>> number of > > >>> > > > ISR > > >>> > > > > > replicas no smaller than majority of the replicas AND no > > >>> smaller > > >>> > > > > > than{min.isr}.2. Clarify in the docs that if a user wants > to > > >>> > > tolerate X > > >>> > > > > > failures, she needs to set client acks=all or acks=quorum > > >>> (better > > >>> > > tail > > >>> > > > > > latency than "all") with broker {min.sir} to be X+1; > however, > > >>> "all" > > >>> > > is > > >>> > > > not > > >>> > > > > > necessarily stronger than "quorum".* > > >>> > > > > > > > >>> > > > > > Concerns from KIP-250 are: > > >>> > > > > > 1. Introducing a new leader LEO based election method. This > > is > > >>> not > > >>> > > > clear in > > >>> > > > > > the KIP-250 and needs more discussion > > >>> > > > > > 2. The KIP-250 also tried to optimize the consumer latency > to > > >>> read > > >>> > > > messages > > >>> > > > > > beyond high watermark, which also has some discussion about > > >>> how to > > >>> > > > achieve > > >>> > > > > > that, and no conclusion > > >>> > > > > > > > >>> > > > > > Both of the above 2 concerns are out of the scope of my > > >>> current KIP. > > >>> > > > > > So, I think it's good to provide this `acks=quorum` or > > >>> > > > > > `acks=min.insync.replicas` option to users to give them > > another > > >>> > > choice. > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > Thank you. > > >>> > > > > > Luke > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > On Wed, May 10, 2023 at 8:54 AM Ismael Juma < > > ism...@juma.me.uk > > >>> > > > >>> > > wrote: > > >>> > > > > > > > >>> > > > > > > Hi Luke, > > >>> > > > > > > > > >>> > > > > > > Are you aware of > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > >>> > > > > >>> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-250+Add+Support+for+Quorum-based+Producer+Acknowledgment > > >>> > > > > > > ? > > >>> > > > > > > > > >>> > > > > > > Ismael > > >>> > > > > > > > > >>> > > > > > > On Tue, May 9, 2023 at 10:14 PM Luke Chen < > > show...@gmail.com > > >>> > > > >>> > > wrote: > > >>> > > > > > > > > >>> > > > > > > > Hi all, > > >>> > > > > > > > > > >>> > > > > > > > I'd like to start a discussion for the KIP-926: > > introducing > > >>> > > > > > > > acks=min.insync.replicas config. This KIP is to > introduce > > >>> > > > > > > > `acks=min.insync.replicas` config value in producer, to > > >>> improve > > >>> > > the > > >>> > > > > > write > > >>> > > > > > > > throughput and still guarantee high durability. > > >>> > > > > > > > > > >>> > > > > > > > Please check the link for more detail: > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > >>> > > > > >>> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-926%3A+introducing+acks%3Dmin.insync.replicas+config > > >>> > > > > > > > > > >>> > > > > > > > Any feedback is welcome. > > >>> > > > > > > > > > >>> > > > > > > > Thank you. > > >>> > > > > > > > Luke > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > > >>> > -- > > >>> > ======================== > > >>> > Okada Haruki > > >>> > ocadar...@gmail.com > > >>> > ======================== > > >>> > > >> > > >