Re: [DISCUSS] KIP-966: Eligible Leader Replicas

Jack Vanlightly Wed, 06 Sep 2023 06:33:25 -0700

Hi Calvin,

Regarding partition reassignment, I have two comments.


I notice the KIP says "The AlterPartitionReassignments will not change the ELR" 
however, when a reassignment completes (or reverts) any replicas removed from 
the replica set would be removed from the ELR. Sounds obvious but I figured we 
should be explicit about that.

Reassignment should also respect min.insync.replicas because currently a 
reassignment can complete as long as the ISR is not empty and all added 
replicas are members. However, my TLA+ specification, which now includes 
reassignment, finds single broker failures that can cause committed data loss - 
despite the added protection of the ELR and min.insync.replicas=2. These 
scenarios are limited to shrinking the size of the replica set. If we modify 
the PartitionChangeBuilder to add the completion condition that the target ISR 
>= min.insync.replicas, then that closes this last single-broker-failure data 
loss case.

With the above modification, the TLA+ specification of the ELR part of the 
design is standing up to all safety and liveness checks. The only thing that is 
not modeled is the unclean recovery though I may leave that as the 
specification is already very large.

Jack

On 2023/09/01 22:27:10 Calvin Liu wrote:
> Hi Justine
> 1. With the new proposal, in order to let the consumer consume a message
> when only 1 replica commits it to its log, the min ISR has to be set to 1.
> 2. Yes, we will need an unclean recovery if the leader has an unclean
> shutdown.
> 3. If the min ISR config is changed to a larger value, the ISR and ELR will
> not be updated. ISR members are always valid no matter how min ISR changes.
> If ELR is not empty, then the HWM can't advance as well after the min ISR
> increase, so the ELR members are safe to stay.
> 4. I will highlight the explanation. Thanks.
> 
> On Thu, Aug 31, 2023 at 4:35 PM Justine Olshan <jols...@confluent.io.invalid>
> wrote:
> 
> > Hey Calvin,
> >
> > Thanks for the responses. I think I understood most of it, but had a few
> > follow up questions
> >
> > 1. For the acks=1 case, I was wondering if there is any way to continue
> > with the current behavior (ie -- we only need one ack to produce to the log
> > and consider the request complete.) My understanding is that we can also
> > consume from such topics at that point.
> > If users wanted this lower durability could they set min.insync.replicas to
> > 1?
> >
> > 2. For the case where we elect a leader that was unknowingly offline. Say
> > this replica was the only one in ELR. My understanding is that we would
> > promote it to ISR and remove it from ELR when it is the leader, but then we
> > would remove it from ISR and have no brokers in ISR or ELR. From there we
> > would need to do unclean recovery right?
> >
> > 3. Did we address the case where dynamically min isr is increased?
> >
> > 4. I think my comment was more about confusion on the KIP. It was not clear
> > to me that the section was describing points if one was done before the
> > other. But I now see the sentence explaining that. I think I skipped from
> > "delivery plan" to the bullet points.
> >
> > Justine
> >
> > On Thu, Aug 31, 2023 at 4:04 PM Calvin Liu <ca...@confluent.io.invalid>
> > wrote:
> >
> > > Hi Justine
> > > Thanks for the questions!
> > >   *a. For my understanding, will we block replication? Or just the high
> > > watermark advancement?*
> > >   - The replication will not be blocked. The followers are free to
> > > replicate messages above the HWM. Only HWM advancement is blocked.
> > >
> > >   b. *Also in the acks=1 case, if folks want to continue the previous
> > > behavior, they also need to set min.insync.replicas to 1, correct?*
> > >   - If the clients only send ack=1 messages and minISR=2. The HWM
> > behavior
> > > will only be different when there is 1 replica in the ISR. In this case,
> > > the min ISR does not do much in the current system. It is kind of a
> > > trade-off but we think it is ok.
> > >
> > >   c. *The KIP seems to suggest that we remove from ELR when we start up
> > > again and notice we do not have the clean shutdown file. Is there a
> > chance
> > > we have an offline broker in ELR that had an unclean shutdown that we
> > elect
> > > as a leader before we get the change to realize the shutdown was
> > unclean?*
> > > *  - *The controller will only elect an unfenced(online) replica as the
> > > leader. If a broker has an unclean shutdown, it should register to the
> > > controller first(where it has to declair whether it is a clean/unclean
> > > shutdown) and then start to serve broker requests. So
> > >      1. If the broker has an unclean shutdown before the controller is
> > > aware that the replica is offline, then the broker can become the leader
> > > temporarily. But it can't serve any Fetch requests before it registers
> > > again, and that's when the controller will re-elect a leader.
> > >      2. If the controller knows the replica is offline(missing heartbeats
> > > from the broker for a while) before the broker re-registers, the broker
> > > can't be elected as a leader.
> > >
> > > d. *Would this be the case for strictly a smaller min ISR?*
> > > - Yes, only when we have a smaller min ISR. Once the leader is aware of
> > the
> > > minISR change, the HWM can advance and make the current ELR obsolete. So
> > > the controller should clear the ELR if the ISR >= the new min ISR.
> > >
> > > e. *I thought we said the above "Last Leader” behavior can’t be
> > maintained
> > > with an empty ISR and it should be removed."*
> > > -  As the Kip is a big one, we have to consider delivering it in phases.
> > If
> > > only the Unclean Recovery is delivered, we do not touch the ISR then the
> > > ISR behavior will be the same as the current. I am open to the proposal
> > > that directly starting unclean recovery if the last leader fails. Let's
> > see
> > > if other folks hope to have more if Unclean Recover delivers first.
> > >
> > > On Tue, Aug 29, 2023 at 4:53 PM Justine Olshan
> > > <jols...@confluent.io.invalid>
> > > wrote:
> > >
> > > > Hey Calvin,
> > > >
> > > > Thanks for the KIP. This will close some of the gaps in leader
> > election!
> > > I
> > > > has a few questions:
> > > >
> > > > *>* *High Watermark can only advance if the ISR size is larger or equal
> > > > to min.insync.replicas*.
> > > >
> > > > For my understanding, will we block replication? Or just the high
> > > watermark
> > > > advancement?
> > > > Also in the acks=1 case, if folks want to continue the previous
> > behavior,
> > > > they also need to set min.insync.replicas to 1, correct? It seems like
> > > this
> > > > change takes some control away from clients when it comes to durability
> > > vs
> > > > availability.
> > > >
> > > > *> *
> > > > *ELR + ISR size will not be dropped below the min ISR unless the
> > > controller
> > > > discovers an ELR member has an unclean shutdown. *
> > > > The KIP seems to suggest that we remove from ELR when we start up again
> > > and
> > > > notice we do not have the clean shutdown file. Is there a chance we
> > have
> > > an
> > > > offline broker in ELR that had an unclean shutdown that we elect as a
> > > > leader before we get the change to realize the shutdown was unclean?
> > > > This seems like it could cause some problems. I may have missed how we
> > > > avoid this scenario though.
> > > >
> > > > *> When updating the config **min.insync.replicas, *
> > > > *if the new min ISR <= current ISR, the ELR will be removed.*Would this
> > > be
> > > > the case for strictly a smaller min ISR? I suppose if we increase the
> > > ISR,
> > > > we can't reason about ELR. Can we reason about high water mark in this
> > > > case--seems like we will have the broker out of ISR not in ISR or ELR?
> > > > (Forgive me if we can't increase min ISR if the increase will put us
> > > under
> > > > it)
> > > >
> > > > *> Unclean recovery. *
> > > >
> > > >    - *The unclean leader election will be replaced by the unclean
> > > > recovery.*
> > > >    - *unclean.leader.election.enable will only be replaced by
> > > >    the unclean.recovery.strategy after ELR is delivered.*
> > > >    - *As there is no change to the ISR, the "last known leader"
> > behavior
> > > is
> > > >    maintained.*
> > > >
> > > > What does "last known leader behavior maintained" mean here? I thought
> > we
> > > > said *"*The above “*Last Leader” behavior can’t be maintained with an
> > > empty
> > > > ISR and it should be removed." *My understanding is once metadata
> > version
> > > > is updated we will always take the more thoughtful unclean election
> > > process
> > > > (ie, inspect the logs)
> > > >
> > > > Overall though, the general KIP is pretty solid. Looking at the
> > rejected
> > > > alternatives, it looks like a lot was considered, so it's nice to see
> > the
> > > > final proposal.
> > > >
> > > > Justine
> > > >
> > > > On Mon, Aug 14, 2023 at 8:50 AM Calvin Liu <ca...@confluent.io.invalid
> > >
> > > > wrote:
> > > >
> > > > >    1. Yes, the new protocol requires 2 things to advance the HWM. a)
> > > The
> > > > >    messages have been replicated to the controller-committed ISR
> > > members.
> > > > > b)
> > > > >    The number of ISR members should be at least the min ISR.
> > > > >    2. With the current protocol, we are not able to select broker 1
> > as
> > > > the
> > > > >    leader. If we first imply we have the new HWM requirement in
> > place,
> > > > then
> > > > >    broker 1 is a good candidate to choose. The following part of the
> > > KIP
> > > > > (ELR)
> > > > >    part will explain a new mechanism to enable us to choose broker 1.
> > > > > Note, if
> > > > >    both HWM and ELR are in place, broker 1 will be actually elected
> > in
> > > > T3.
> > > > >
> > > > >
> > > > > On Fri, Aug 11, 2023 at 10:05 AM Jeff Kim
> > > <jeff....@confluent.io.invalid
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Calvin,
> > > > > >
> > > > > > Thanks for the KIP! I'm still digesting it but I have two
> > questions:
> > > > > >
> > > > > > > In the scenario raised in the motivation section, the server may
> > > > > receive
> > > > > > ack=1 messages during T1 and advance High Watermark when the leader
> > > > > > is the only one in ISR.
> > > > > >
> > > > > > To confirm, the current protocol allows advancing the HWM if all
> > > > brokers
> > > > > in
> > > > > > the ISR append to their logs (in this case only the leader). And
> > > we're
> > > > > > proposing
> > > > > > to advance the HWM only when <at least min.insync.replicas> brokers
> > > > > > replicate. Is this correct?
> > > > > >
> > > > > > > Then, if we elect broker 1 as the leader at T4, though we can
> > > > guarantee
> > > > > > the safety of ack=all messages, the High Watermark may move
> > backward
> > > > > > which causes further impacts on the consumers.
> > > > > >
> > > > > > How can broker 1 become the leader if it was ineligible in T3? Or
> > are
> > > > > > you referring to broker 2?
> > > > > >
> > > > > > Thanks,
> > > > > > Jeff
> > > > > >
> > > > > > On Thu, Aug 10, 2023 at 6:48 PM Calvin Liu
> > > <ca...@confluent.io.invalid
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > > I'd like to discuss a series of enhancement to the replication
> > > > > protocol.
> > > > > > >
> > > > > > > A partition replica can experience local data loss in unclean
> > > > shutdown
> > > > > > > scenarios where unflushed data in the OS page cache is lost -
> > such
> > > as
> > > > > an
> > > > > > > availability zone power outage or a server error. The Kafka
> > > > replication
> > > > > > > protocol is designed to handle these situations by removing such
> > > > > replicas
> > > > > > > from the ISR and only re-adding them once they have caught up and
> > > > > > therefore
> > > > > > > recovered any lost data. This prevents replicas that lost an
> > > > arbitrary
> > > > > > log
> > > > > > > suffix, which included committed data, from being elected leader.
> > > > > > > However, there is a "last replica standing" state which when
> > > combined
> > > > > > with
> > > > > > > a data loss unclean shutdown event can turn a local data loss
> > > > scenario
> > > > > > into
> > > > > > > a global data loss scenario, i.e., committed data can be removed
> > > from
> > > > > all
> > > > > > > replicas. When the last replica in the ISR experiences an unclean
> > > > > > shutdown
> > > > > > > and loses committed data, it will be reelected leader after
> > > starting
> > > > up
> > > > > > > again, causing rejoining followers to truncate their logs and
> > > thereby
> > > > > > > removing the last copies of the committed records which the
> > leader
> > > > lost
> > > > > > > initially.
> > > > > > >
> > > > > > > The new KIP will maximize the protection and provides MinISR-1
> > > > > tolerance
> > > > > > to
> > > > > > > data loss unclean shutdown events.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-966: Eligible Leader Replicas

Reply via email to