Re: [DISCUSS] KIP-966: Eligible Leader Replicas

Calvin Liu Wed, 06 Sep 2023 11:21:44 -0700

Hi Jack
Thanks for the comment.
I have updated the reassignment part. Now the reassignment can only be
completed/canceled if the final ISR size is larger than min ISR.
Thanks to your efforts of the TLA+! It has been a great help to the KIP!


On Wed, Sep 6, 2023 at 6:32 AM Jack Vanlightly <vanligh...@apache.org>
wrote:

> Hi Calvin,
>
> Regarding partition reassignment, I have two comments.
>
> I notice the KIP says "The AlterPartitionReassignments will not change the
> ELR" however, when a reassignment completes (or reverts) any replicas
> removed from the replica set would be removed from the ELR. Sounds obvious
> but I figured we should be explicit about that.
>
> Reassignment should also respect min.insync.replicas because currently a
> reassignment can complete as long as the ISR is not empty and all added
> replicas are members. However, my TLA+ specification, which now includes
> reassignment, finds single broker failures that can cause committed data
> loss - despite the added protection of the ELR and min.insync.replicas=2.
> These scenarios are limited to shrinking the size of the replica set. If we
> modify the PartitionChangeBuilder to add the completion condition that the
> target ISR >= min.insync.replicas, then that closes this last
> single-broker-failure data loss case.
>
> With the above modification, the TLA+ specification of the ELR part of the
> design is standing up to all safety and liveness checks. The only thing
> that is not modeled is the unclean recovery though I may leave that as the
> specification is already very large.
>
> Jack
>
> On 2023/09/01 22:27:10 Calvin Liu wrote:
> > Hi Justine
> > 1. With the new proposal, in order to let the consumer consume a message
> > when only 1 replica commits it to its log, the min ISR has to be set to
> 1.
> > 2. Yes, we will need an unclean recovery if the leader has an unclean
> > shutdown.
> > 3. If the min ISR config is changed to a larger value, the ISR and ELR
> will
> > not be updated. ISR members are always valid no matter how min ISR
> changes.
> > If ELR is not empty, then the HWM can't advance as well after the min ISR
> > increase, so the ELR members are safe to stay.
> > 4. I will highlight the explanation. Thanks.
> >
> > On Thu, Aug 31, 2023 at 4:35 PM Justine Olshan
> <jols...@confluent.io.invalid>
> > wrote:
> >
> > > Hey Calvin,
> > >
> > > Thanks for the responses. I think I understood most of it, but had a
> few
> > > follow up questions
> > >
> > > 1. For the acks=1 case, I was wondering if there is any way to continue
> > > with the current behavior (ie -- we only need one ack to produce to
> the log
> > > and consider the request complete.) My understanding is that we can
> also
> > > consume from such topics at that point.
> > > If users wanted this lower durability could they set
> min.insync.replicas to
> > > 1?
> > >
> > > 2. For the case where we elect a leader that was unknowingly offline.
> Say
> > > this replica was the only one in ELR. My understanding is that we would
> > > promote it to ISR and remove it from ELR when it is the leader, but
> then we
> > > would remove it from ISR and have no brokers in ISR or ELR. From there
> we
> > > would need to do unclean recovery right?
> > >
> > > 3. Did we address the case where dynamically min isr is increased?
> > >
> > > 4. I think my comment was more about confusion on the KIP. It was not
> clear
> > > to me that the section was describing points if one was done before the
> > > other. But I now see the sentence explaining that. I think I skipped
> from
> > > "delivery plan" to the bullet points.
> > >
> > > Justine
> > >
> > > On Thu, Aug 31, 2023 at 4:04 PM Calvin Liu <ca...@confluent.io.invalid
> >
> > > wrote:
> > >
> > > > Hi Justine
> > > > Thanks for the questions!
> > > >   *a. For my understanding, will we block replication? Or just the
> high
> > > > watermark advancement?*
> > > >   - The replication will not be blocked. The followers are free to
> > > > replicate messages above the HWM. Only HWM advancement is blocked.
> > > >
> > > >   b. *Also in the acks=1 case, if folks want to continue the previous
> > > > behavior, they also need to set min.insync.replicas to 1, correct?*
> > > >   - If the clients only send ack=1 messages and minISR=2. The HWM
> > > behavior
> > > > will only be different when there is 1 replica in the ISR. In this
> case,
> > > > the min ISR does not do much in the current system. It is kind of a
> > > > trade-off but we think it is ok.
> > > >
> > > >   c. *The KIP seems to suggest that we remove from ELR when we start
> up
> > > > again and notice we do not have the clean shutdown file. Is there a
> > > chance
> > > > we have an offline broker in ELR that had an unclean shutdown that we
> > > elect
> > > > as a leader before we get the change to realize the shutdown was
> > > unclean?*
> > > > *  - *The controller will only elect an unfenced(online) replica as
> the
> > > > leader. If a broker has an unclean shutdown, it should register to
> the
> > > > controller first(where it has to declair whether it is a
> clean/unclean
> > > > shutdown) and then start to serve broker requests. So
> > > >      1. If the broker has an unclean shutdown before the controller
> is
> > > > aware that the replica is offline, then the broker can become the
> leader
> > > > temporarily. But it can't serve any Fetch requests before it
> registers
> > > > again, and that's when the controller will re-elect a leader.
> > > >      2. If the controller knows the replica is offline(missing
> heartbeats
> > > > from the broker for a while) before the broker re-registers, the
> broker
> > > > can't be elected as a leader.
> > > >
> > > > d. *Would this be the case for strictly a smaller min ISR?*
> > > > - Yes, only when we have a smaller min ISR. Once the leader is aware
> of
> > > the
> > > > minISR change, the HWM can advance and make the current ELR
> obsolete. So
> > > > the controller should clear the ELR if the ISR >= the new min ISR.
> > > >
> > > > e. *I thought we said the above "Last Leader” behavior can’t be
> > > maintained
> > > > with an empty ISR and it should be removed."*
> > > > -  As the Kip is a big one, we have to consider delivering it in
> phases.
> > > If
> > > > only the Unclean Recovery is delivered, we do not touch the ISR then
> the
> > > > ISR behavior will be the same as the current. I am open to the
> proposal
> > > > that directly starting unclean recovery if the last leader fails.
> Let's
> > > see
> > > > if other folks hope to have more if Unclean Recover delivers first.
> > > >
> > > > On Tue, Aug 29, 2023 at 4:53 PM Justine Olshan
> > > > <jols...@confluent.io.invalid>
> > > > wrote:
> > > >
> > > > > Hey Calvin,
> > > > >
> > > > > Thanks for the KIP. This will close some of the gaps in leader
> > > election!
> > > > I
> > > > > has a few questions:
> > > > >
> > > > > *>* *High Watermark can only advance if the ISR size is larger or
> equal
> > > > > to min.insync.replicas*.
> > > > >
> > > > > For my understanding, will we block replication? Or just the high
> > > > watermark
> > > > > advancement?
> > > > > Also in the acks=1 case, if folks want to continue the previous
> > > behavior,
> > > > > they also need to set min.insync.replicas to 1, correct? It seems
> like
> > > > this
> > > > > change takes some control away from clients when it comes to
> durability
> > > > vs
> > > > > availability.
> > > > >
> > > > > *> *
> > > > > *ELR + ISR size will not be dropped below the min ISR unless the
> > > > controller
> > > > > discovers an ELR member has an unclean shutdown. *
> > > > > The KIP seems to suggest that we remove from ELR when we start up
> again
> > > > and
> > > > > notice we do not have the clean shutdown file. Is there a chance we
> > > have
> > > > an
> > > > > offline broker in ELR that had an unclean shutdown that we elect
> as a
> > > > > leader before we get the change to realize the shutdown was
> unclean?
> > > > > This seems like it could cause some problems. I may have missed
> how we
> > > > > avoid this scenario though.
> > > > >
> > > > > *> When updating the config **min.insync.replicas, *
> > > > > *if the new min ISR <= current ISR, the ELR will be removed.*Would
> this
> > > > be
> > > > > the case for strictly a smaller min ISR? I suppose if we increase
> the
> > > > ISR,
> > > > > we can't reason about ELR. Can we reason about high water mark in
> this
> > > > > case--seems like we will have the broker out of ISR not in ISR or
> ELR?
> > > > > (Forgive me if we can't increase min ISR if the increase will put
> us
> > > > under
> > > > > it)
> > > > >
> > > > > *> Unclean recovery. *
> > > > >
> > > > >    - *The unclean leader election will be replaced by the unclean
> > > > > recovery.*
> > > > >    - *unclean.leader.election.enable will only be replaced by
> > > > >    the unclean.recovery.strategy after ELR is delivered.*
> > > > >    - *As there is no change to the ISR, the "last known leader"
> > > behavior
> > > > is
> > > > >    maintained.*
> > > > >
> > > > > What does "last known leader behavior maintained" mean here? I
> thought
> > > we
> > > > > said *"*The above “*Last Leader” behavior can’t be maintained with
> an
> > > > empty
> > > > > ISR and it should be removed." *My understanding is once metadata
> > > version
> > > > > is updated we will always take the more thoughtful unclean election
> > > > process
> > > > > (ie, inspect the logs)
> > > > >
> > > > > Overall though, the general KIP is pretty solid. Looking at the
> > > rejected
> > > > > alternatives, it looks like a lot was considered, so it's nice to
> see
> > > the
> > > > > final proposal.
> > > > >
> > > > > Justine
> > > > >
> > > > > On Mon, Aug 14, 2023 at 8:50 AM Calvin Liu
> <ca...@confluent.io.invalid
> > > >
> > > > > wrote:
> > > > >
> > > > > >    1. Yes, the new protocol requires 2 things to advance the
> HWM. a)
> > > > The
> > > > > >    messages have been replicated to the controller-committed ISR
> > > > members.
> > > > > > b)
> > > > > >    The number of ISR members should be at least the min ISR.
> > > > > >    2. With the current protocol, we are not able to select
> broker 1
> > > as
> > > > > the
> > > > > >    leader. If we first imply we have the new HWM requirement in
> > > place,
> > > > > then
> > > > > >    broker 1 is a good candidate to choose. The following part of
> the
> > > > KIP
> > > > > > (ELR)
> > > > > >    part will explain a new mechanism to enable us to choose
> broker 1.
> > > > > > Note, if
> > > > > >    both HWM and ELR are in place, broker 1 will be actually
> elected
> > > in
> > > > > T3.
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 11, 2023 at 10:05 AM Jeff Kim
> > > > <jeff....@confluent.io.invalid
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Calvin,
> > > > > > >
> > > > > > > Thanks for the KIP! I'm still digesting it but I have two
> > > questions:
> > > > > > >
> > > > > > > > In the scenario raised in the motivation section, the server
> may
> > > > > > receive
> > > > > > > ack=1 messages during T1 and advance High Watermark when the
> leader
> > > > > > > is the only one in ISR.
> > > > > > >
> > > > > > > To confirm, the current protocol allows advancing the HWM if
> all
> > > > > brokers
> > > > > > in
> > > > > > > the ISR append to their logs (in this case only the leader).
> And
> > > > we're
> > > > > > > proposing
> > > > > > > to advance the HWM only when <at least min.insync.replicas>
> brokers
> > > > > > > replicate. Is this correct?
> > > > > > >
> > > > > > > > Then, if we elect broker 1 as the leader at T4, though we can
> > > > > guarantee
> > > > > > > the safety of ack=all messages, the High Watermark may move
> > > backward
> > > > > > > which causes further impacts on the consumers.
> > > > > > >
> > > > > > > How can broker 1 become the leader if it was ineligible in T3?
> Or
> > > are
> > > > > > > you referring to broker 2?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jeff
> > > > > > >
> > > > > > > On Thu, Aug 10, 2023 at 6:48 PM Calvin Liu
> > > > <ca...@confluent.io.invalid
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > > I'd like to discuss a series of enhancement to the
> replication
> > > > > > protocol.
> > > > > > > >
> > > > > > > > A partition replica can experience local data loss in unclean
> > > > > shutdown
> > > > > > > > scenarios where unflushed data in the OS page cache is lost -
> > > such
> > > > as
> > > > > > an
> > > > > > > > availability zone power outage or a server error. The Kafka
> > > > > replication
> > > > > > > > protocol is designed to handle these situations by removing
> such
> > > > > > replicas
> > > > > > > > from the ISR and only re-adding them once they have caught
> up and
> > > > > > > therefore
> > > > > > > > recovered any lost data. This prevents replicas that lost an
> > > > > arbitrary
> > > > > > > log
> > > > > > > > suffix, which included committed data, from being elected
> leader.
> > > > > > > > However, there is a "last replica standing" state which when
> > > > combined
> > > > > > > with
> > > > > > > > a data loss unclean shutdown event can turn a local data loss
> > > > > scenario
> > > > > > > into
> > > > > > > > a global data loss scenario, i.e., committed data can be
> removed
> > > > from
> > > > > > all
> > > > > > > > replicas. When the last replica in the ISR experiences an
> unclean
> > > > > > > shutdown
> > > > > > > > and loses committed data, it will be reelected leader after
> > > > starting
> > > > > up
> > > > > > > > again, causing rejoining followers to truncate their logs and
> > > > thereby
> > > > > > > > removing the last copies of the committed records which the
> > > leader
> > > > > lost
> > > > > > > > initially.
> > > > > > > >
> > > > > > > > The new KIP will maximize the protection and provides
> MinISR-1
> > > > > > tolerance
> > > > > > > to
> > > > > > > > data loss unclean shutdown events.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-966: Eligible Leader Replicas

Reply via email to