Re: [DISCUSS] KIP-926: introducing acks=min.insync.replicas config

Luke Chen Fri, 12 May 2023 01:57:59 -0700

Hi David,

Thanks for the response.
But I don't think the LEO-based leader election only benefit for this case.
Like in unclean clear election case, we now randomly chose a out-of-sync
replica to become the leader.
This LEO-based leader election will help this case, too.
Besides, not all producers use `acks=all`, thus, when using `acks=0` or
`acks=1`, they can also benefit from LEO-based leader election.


That's why I think this could be in a separate KIP, and after that's
introduced, this KIP will be a further improvement based on that.
Does that make sense?

Actually, I don't strongly insist in doing this, if you still think they
should be proposed together, I can update the KIP, too.

Thank you.
Luke

On Fri, May 12, 2023 at 4:48 PM David Jacot <dja...@confluent.io.invalid>
wrote:

> Hi Luke,
>
> I disagree with this because we don't need the leader election change on
> its own if we don't do this KIP. They have to go together or not at all in
> my opinion. We need a KIP which designs the entire solution.
>
> Best,
> David
>
> On Fri, May 12, 2023 at 10:33 AM Luke Chen <show...@gmail.com> wrote:
>
> > Hi Alexandre,
> >
> > Thanks for the thoughts.
> > I've thought about it, and think I would choose to have a new leader
> > election method to fix the problem we encountered, not this "backup-only"
> > replica solution.
> > But this is still an interesting idea. Like what you've said, this
> solution
> > can bring many benefits.
> > So maybe you can create a proposal for it?
> >
> > Thank you.
> > Luke
> >
> > On Fri, May 12, 2023 at 4:21 PM Luke Chen <show...@gmail.com> wrote:
> >
> > > Hi Haruki,
> > >
> > > Yes, this scenario could happen.
> > > I'm thinking we can fix it in step 6, when controller tried to get LEO
> > > from B,C replicas, the B,C replica should stop fetcher for this
> partition
> > > immediately, before returning the LEO.
> > > About if we need quorum-based or not, We can discuss in another KIP.
> I'm
> > > still thinking about it.
> > >
> > > Thank you.
> > > Luke
> > >
> > >
> > > On Fri, May 12, 2023 at 3:59 PM Luke Chen <show...@gmail.com> wrote:
> > >
> > >> Hi David,
> > >>
> > >> > It can't be in another KIP as it is required for your proposal to
> > work.
> > >> This is also an important part to discuss as it requires the
> controller
> > to
> > >> do more operations on leader changes.
> > >>
> > >> Yes, I know this is a requirement for this KIP to work, and need a lot
> > of
> > >> discussion.
> > >> So that's why I think it'd be better to have a separate KIP to write
> the
> > >> content and discussion.
> > >> I've put the status of this KIP as "pending" and added a note on the
> top
> > >> of this KIP:
> > >>
> > >> Note: This KIP requires leader election change, which will be proposed
> > in
> > >> another KIP.
> > >>
> > >> Thanks.
> > >> Luke
> > >>
> > >> On Thu, May 11, 2023 at 11:43 PM Alexandre Dupriez <
> > >> alexandre.dupr...@gmail.com> wrote:
> > >>
> > >>> Hi, Luke,
> > >>>
> > >>> Thanks for your reply.
> > >>>
> > >>> 102. Whether such a replica could become a leader depends on what the
> > >>> end-user wants to use it for and what tradeoffs they wish to make
> down
> > >>> the line.
> > >>>
> > >>> There are cases, for instance with heterogeneous or interregional
> > >>> networks, where the difference in latency between subsets of brokers
> > >>> can be high enough for the "slow replicas" to have a detrimental
> > >>> impact on the ISR traffic they take part in. This can justify
> > >>> permanently segregating them from ISR traffic by design. And, an
> > >>> end-user could still prefer to have these "slow replicas" versus
> > >>> alternative approaches such as mirroring for the benefits they can
> > >>> bring, for instance: a) they belong to the same cluster with no added
> > >>> admin and ops, b) benefit from a direct, simpler replication path, c)
> > >>> require less infrastructure than a mirrored solution, d) could become
> > >>> unclean leaders for failovers under disaster scenarios such as a
> > >>> regional service outages.
> > >>>
> > >>> Thanks,
> > >>> Alexandre
> > >>>
> > >>> Le jeu. 11 mai 2023 à 14:57, Haruki Okada <ocadar...@gmail.com> a
> > écrit
> > >>> :
> > >>> >
> > >>> > Hi, Luke.
> > >>> >
> > >>> > Though this proposal definitely looks interesting, as others
> pointed
> > >>> out,
> > >>> > the leader election implementation would be the hard part.
> > >>> >
> > >>> > And I think even LEO-based-election is not safe, which could cause
> > >>> silent
> > >>> > committed-data loss easily.
> > >>> >
> > >>> > Let's say we have replicas A,B,C and A is the leader initially, and
> > >>> > min.insync.replicas = 2.
> > >>> >
> > >>> > - 1. Initial
> > >>> >     * A(leo=0), B(leo=0), C(leo=0)
> > >>> > - 2. Produce a message to A
> > >>> >     * A(leo=1), B(leo=0), C(leo=0)
> > >>> > - 3. Another producer produces a message to A (i.e. as the
> different
> > >>> batch)
> > >>> >     * A(leo=2), B(leo=0), C(leo=0)
> > >>> > - 4. C replicates the first batch. offset=1 is committed (by
> > >>> > acks=min.insync.replicas)
> > >>> >     * A(leo=2), B(leo=0), C(leo=1)
> > >>> > - 5. A loses ZK session (or broker session timeout in KRaft)
> > >>> > - 6. Controller (regardless ZK/KRaft) doesn't store LEO in itself,
> so
> > >>> it
> > >>> > needs to interact with each replica. It detects C has the largest
> LEO
> > >>> and
> > >>> > decided to elect C as the new leader
> > >>> > - 7. Before leader-election is performed, B replicates offset=1,2
> > from
> > >>> A.
> > >>> > offset=2 is committed
> > >>> >     * This is possible because even if A lost ZK session, A could
> > >>> handle
> > >>> > fetch requests for a while.
> > >>> > - 8. Controller elects C as the new leader. B truncates its offset.
> > >>> > offset=2 is lost silently.
> > >>> >
> > >>> > I have a feeling that we need quorum-based data replication? as
> Divij
> > >>> > pointed out.
> > >>> >
> > >>> >
> > >>> > 2023年5月11日(木) 22:33 David Jacot <dja...@confluent.io.invalid>:
> > >>> >
> > >>> > > Hi Luke,
> > >>> > >
> > >>> > > > Yes, on second thought, I think the new leader election is
> > >>> required to
> > >>> > > work
> > >>> > > for this new acks option. I'll think about it and open another
> KIP
> > >>> for it.
> > >>> > >
> > >>> > > It can't be in another KIP as it is required for your proposal to
> > >>> work.
> > >>> > > This is also an important part to discuss as it requires the
> > >>> controller to
> > >>> > > do more operations on leader changes.
> > >>> > >
> > >>> > > Cheers,
> > >>> > > David
> > >>> > >
> > >>> > > On Thu, May 11, 2023 at 2:44 PM Luke Chen <show...@gmail.com>
> > wrote:
> > >>> > >
> > >>> > > > Hi Ismael,
> > >>> > > > Yes, on second thought, I think the new leader election is
> > >>> required to
> > >>> > > work
> > >>> > > > for this new acks option. I'll think about it and open another
> > KIP
> > >>> for
> > >>> > > it.
> > >>> > > >
> > >>> > > > Hi Divij,
> > >>> > > > Yes, I agree with all of them. I'll think about it and let you
> > >>> know how
> > >>> > > we
> > >>> > > > can work together.
> > >>> > > >
> > >>> > > > Hi Alexandre,
> > >>> > > > > 100. The KIP makes one statement which may be considered
> > >>> critical:
> > >>> > > > "Note that in acks=min.insync.replicas case, the slow follower
> > >>> might
> > >>> > > > be easier to become out of sync than acks=all.". Would you have
> > >>> some
> > >>> > > > data on that behaviour when using the new ack semantic? It
> would
> > be
> > >>> > > > interesting to analyse and especially look at the percentage of
> > >>> time
> > >>> > > > the number of replicas in ISR is reduced to the configured
> > >>> > > > min.insync.replicas.
> > >>> > > >
> > >>> > > > The comparison data would be interesting. I can have a test
> when
> > >>> > > available.
> > >>> > > > But this KIP will be deprioritized because there should be a
> > >>> > > pre-requisite
> > >>> > > > KIP for it.
> > >>> > > >
> > >>> > > > > A (perhaps naive) hypothesis would be that the
> > >>> > > > new ack semantic indeed provides better produce latency, but at
> > the
> > >>> > > > cost of precipitating the slowest replica(s) out of the ISR?
> > >>> > > >
> > >>> > > > Yes, it could be.
> > >>> > > >
> > >>> > > > > 101. I understand the impact on produce latency, but I am not
> > >>> sure
> > >>> > > > about the impact on durability. Is your durability model built
> > >>> against
> > >>> > > > the replication factor or the number of min insync replicas?
> > >>> > > >
> > >>> > > > Yes, and also the new LEO-based leader election (not proposed
> > yet).
> > >>> > > >
> > >>> > > > > 102. Could a new type of replica which would not be allowed
> to
> > >>> enter
> > >>> > > > the ISR be an alternative? Such replica could attempt
> replication
> > >>> on a
> > >>> > > > best-effort basis and would provide the permanent guarantee not
> > to
> > >>> > > > interfere with foreground traffic.
> > >>> > > >
> > >>> > > > You mean a backup replica, which will never become leader
> > >>> (in-sync),
> > >>> > > right?
> > >>> > > > That's an interesting thought and might be able to become a
> > >>> workaround
> > >>> > > with
> > >>> > > > the existing leader election. Let me think about it.
> > >>> > > >
> > >>> > > > Hi qiangLiu,
> > >>> > > >
> > >>> > > > > It's a good point that add this config and get better P99
> > >>> latency, but
> > >>> > > is
> > >>> > > > this changing the meaning of "in sync replicas"? consider a
> > >>> situation
> > >>> > > with
> > >>> > > > "replica=3 acks=2", when two broker fail and left only the
> broker
> > >>> that
> > >>> > > > does't have the message, it is in sync, so will be elected as
> > >>> leader,
> > >>> > > will
> > >>> > > > it cause a NOT NOTICED lost of acked messages?
> > >>> > > >
> > >>> > > > Yes, it will, so the `min.insync.replicas` config in the
> > >>> broker/topic
> > >>> > > level
> > >>> > > > should be set correctly. In your example, it should be set to
> 2,
> > >>> so that
> > >>> > > > when 2 replicas down, no new message write will be successful.
> > >>> > > >
> > >>> > > >
> > >>> > > > Thank you.
> > >>> > > > Luke
> > >>> > > >
> > >>> > > >
> > >>> > > > On Thu, May 11, 2023 at 12:16 PM 67 <6...@gd67.com> wrote:
> > >>> > > >
> > >>> > > > > Hi Luke,
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > It's a good point that add this config and get better P99
> > >>> latency, but
> > >>> > > is
> > >>> > > > > this changing the meaning of "in sync replicas"? consider a
> > >>> situation
> > >>> > > > with
> > >>> > > > > "replica=3 acks=2", when two broker fail and left only the
> > >>> broker that
> > >>> > > > > does't have the message, it is in sync, so will be elected as
> > >>> leader,
> > >>> > > > will
> > >>> > > > > it cause a *NOT NOTICED* lost of acked messages?
> > >>> > > > >
> > >>> > > > > qiangLiu
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > 在2023年05月10 12时58分，"Ismael Juma"<ism...@juma.me.uk>写道：
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > Hi Luke,
> > >>> > > > >
> > >>> > > > > As discussed in the other KIP, there are some subtleties when
> > it
> > >>> comes
> > >>> > > to
> > >>> > > > > the semantics of the system if we don't wait for all members
> of
> > >>> the isr
> > >>> > > > > before we ack. I don't understand why you say the leader
> > election
> > >>> > > > question
> > >>> > > > > is out of scope - it seems to be a core aspect to me.
> > >>> > > > >
> > >>> > > > > Ismael
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > On Wed, May 10, 2023, 8:50 AM Luke Chen <show...@gmail.com>
> > >>> wrote:
> > >>> > > > >
> > >>> > > > > > Hi Ismael,
> > >>> > > > > >
> > >>> > > > > > No, I didn't know about this similar KIP! I hope I've known
> > >>> that so
> > >>> > > > that I
> > >>> > > > > > don't need to spend time to write it again! :(
> > >>> > > > > > I checked the KIP and all the discussions (here
> > >>> > > > > > <
> > >>> https://lists.apache.org/list?dev@kafka.apache.org:gte=100d:KIP-250
> > >>> > > >).
> > >>> > > > I
> > >>> > > > > > think the consensus is that adding a client config to
> > >>> `acks=quorum`
> > >>> > > is
> > >>> > > > > > fine.
> > >>> > > > > > This comment
> > >>> > > > > > <
> > >>> https://lists.apache.org/thread/p77pym5sxpn91r8j364kmmf3qp5g65rn>
> > >>> > > > from
> > >>> > > > > > Guozhang pretty much concluded what I'm trying to do.
> > >>> > > > > >
> > >>> > > > > >
> > >>> > > > > >
> > >>> > > > > >
> > >>> > > > > >
> > >>> > > > > >
> > >>> > > > > >
> > >>> > > > > >
> > >>> > > > > > *1. Add one more value to client-side acks config:   0: no
> > acks
> > >>> > > needed
> > >>> > > > at
> > >>> > > > > > all.   1: ack from the leader.   all: ack from ALL the ISR
> > >>> replicas
> > >>> > > > > >  quorum: this is the new value, it requires ack from enough
> > >>> number of
> > >>> > > > ISR
> > >>> > > > > > replicas no smaller than majority of the replicas AND no
> > >>> smaller
> > >>> > > > > > than{min.isr}.2. Clarify in the docs that if a user wants
> to
> > >>> > > tolerate X
> > >>> > > > > > failures, she needs to set client acks=all or acks=quorum
> > >>> (better
> > >>> > > tail
> > >>> > > > > > latency than "all") with broker {min.sir} to be X+1;
> however,
> > >>> "all"
> > >>> > > is
> > >>> > > > not
> > >>> > > > > > necessarily stronger than "quorum".*
> > >>> > > > > >
> > >>> > > > > > Concerns from KIP-250 are:
> > >>> > > > > > 1. Introducing a new leader LEO based election method. This
> > is
> > >>> not
> > >>> > > > clear in
> > >>> > > > > > the KIP-250 and needs more discussion
> > >>> > > > > > 2. The KIP-250 also tried to optimize the consumer latency
> to
> > >>> read
> > >>> > > > messages
> > >>> > > > > > beyond high watermark, which also has some discussion about
> > >>> how to
> > >>> > > > achieve
> > >>> > > > > > that, and no conclusion
> > >>> > > > > >
> > >>> > > > > > Both of the above 2 concerns are out of the scope of my
> > >>> current KIP.
> > >>> > > > > > So, I think it's good to provide this `acks=quorum` or
> > >>> > > > > > `acks=min.insync.replicas` option to users to give them
> > another
> > >>> > > choice.
> > >>> > > > > >
> > >>> > > > > >
> > >>> > > > > > Thank you.
> > >>> > > > > > Luke
> > >>> > > > > >
> > >>> > > > > >
> > >>> > > > > > On Wed, May 10, 2023 at 8:54 AM Ismael Juma <
> > ism...@juma.me.uk
> > >>> >
> > >>> > > wrote:
> > >>> > > > > >
> > >>> > > > > > > Hi Luke,
> > >>> > > > > > >
> > >>> > > > > > > Are you aware of
> > >>> > > > > > >
> > >>> > > > > > >
> > >>> > > > > >
> > >>> > > >
> > >>> > >
> > >>>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-250+Add+Support+for+Quorum-based+Producer+Acknowledgment
> > >>> > > > > > > ?
> > >>> > > > > > >
> > >>> > > > > > > Ismael
> > >>> > > > > > >
> > >>> > > > > > > On Tue, May 9, 2023 at 10:14 PM Luke Chen <
> > show...@gmail.com
> > >>> >
> > >>> > > wrote:
> > >>> > > > > > >
> > >>> > > > > > > > Hi all,
> > >>> > > > > > > >
> > >>> > > > > > > > I'd like to start a discussion for the KIP-926:
> > introducing
> > >>> > > > > > > > acks=min.insync.replicas config. This KIP is to
> introduce
> > >>> > > > > > > > `acks=min.insync.replicas` config value in producer, to
> > >>> improve
> > >>> > > the
> > >>> > > > > > write
> > >>> > > > > > > > throughput and still guarantee high durability.
> > >>> > > > > > > >
> > >>> > > > > > > > Please check the link for more detail:
> > >>> > > > > > > >
> > >>> > > > > > > >
> > >>> > > > > > >
> > >>> > > > > >
> > >>> > > >
> > >>> > >
> > >>>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-926%3A+introducing+acks%3Dmin.insync.replicas+config
> > >>> > > > > > > >
> > >>> > > > > > > > Any feedback is welcome.
> > >>> > > > > > > >
> > >>> > > > > > > > Thank you.
> > >>> > > > > > > > Luke
> > >>> > > > > > > >
> > >>> > > > > > >
> > >>> > > > > >
> > >>> > > > >
> > >>> > > > >
> > >>> > > > >
> > >>> > > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>> >
> > >>> > --
> > >>> > ========================
> > >>> > Okada Haruki
> > >>> > ocadar...@gmail.com
> > >>> > ========================
> > >>>
> > >>
> >
>

Re: [DISCUSS] KIP-926: introducing acks=min.insync.replicas config

Reply via email to