Re: [DISCUSS] KIP-966: Eligible Leader Replicas

Calvin Liu via dev Mon, 15 Dec 2025 10:59:51 -0800

Hi Martin,
Thanks for the report, this flapping behavior sounds like something we
should address. I opened a ticket to track the progress
https://issues.apache.org/jira/browse/KAFKA-19996.
Thanks again!


On Fri, Nov 28, 2025 at 2:04 AM Martin Dickson via dev <[email protected]>
wrote:

> Hi all,
>
> I wanted to check the expected behaviour of this KIP in the following case.
> Assume you have a partition of topic with minISR=2, RF=3. The leader is
> receiving a lot of traffic on that partition, so much that followers can't
> keep up. One follower is booted out of ISR as normal, but currently the
> second follower flaps in and out of ISR. This is because ISR expand /
> shrink perform different checks:
>
> - Shrink ISR, attempted on a schedule, triggers if the follower's LEO has
> not recently caught up with the leader's LEO (see `isCaughtUp(...)` in
> `Replica.scala`)
> - Expand ISR, attempted on processing follower fetch, triggers if the
> follower's LEO caught up with the leader's HWM in this fetch (via
> `isFollowerInSync(...)` in `Partition.scala`)
>
> and both of these will always trigger: shrink because the follower's LEO
> will be way behind because it has way less data, but also expand because in
> this case the leader's HWM is equal to follower's LEO. Expand triggering is
> new behaviour with KIP-966, the `isFollowerInSync(...)` code itself is
> unchanged but the behaviour is different with the change in HWM advancement
> logic. This causes the flapping which means load on the control plane (for
> each such partition we trigger two requests in every iteration of the
> "isr-expiration" schedule).
>
> We noticed this with a producer using acks=1 on a minISR=2 topic (which,
> FWIW, we consider a misconfiguration). Due to acks=1 we are able to pile up
> a large amount of data on the leader. The HWM is advancing slowly (dictated
> by the follower's speed), but arguably the HWM should not be advancing at
> all because the follower is lagging significantly. With acks=all I believe
> the behaviour would be different, we'd backpressure the producer (either
> via the replication speed or via the partition becoming RO) and wouldn't be
> able to accumulate so much unreplicated data on the leader.
>
> An alternative would be to change the implementation of "expand" to also
> use the leader's LEO. This would be more in line with the
> replica.lag.max.time.ms documentation. I can imagine we don't want to
> over-index on the behaviour of acks=1/minISR=2 examples, but on the other
> hand I'm not immediately seeing why shrink and expand should perform
> different checks in general.
>
> Thanks,
> Martin
>
> On 2023/08/10 22:46:52 Calvin Liu wrote:
> > Hi everyone,
> > I'd like to discuss a series of enhancement to the replication protocol.
> >
> > A partition replica can experience local data loss in unclean shutdown
> > scenarios where unflushed data in the OS page cache is lost - such as an
> > availability zone power outage or a server error. The Kafka replication
> > protocol is designed to handle these situations by removing such replicas
> > from the ISR and only re-adding them once they have caught up and
> therefore
> > recovered any lost data. This prevents replicas that lost an arbitrary
> log
> > suffix, which included committed data, from being elected leader.
> > However, there is a "last replica standing" state which when combined
> with
> > a data loss unclean shutdown event can turn a local data loss scenario
> into
> > a global data loss scenario, i.e., committed data can be removed from all
> > replicas. When the last replica in the ISR experiences an unclean
> shutdown
> > and loses committed data, it will be reelected leader after starting up
> > again, causing rejoining followers to truncate their logs and thereby
> > removing the last copies of the committed records which the leader lost
> > initially.
> >
> > The new KIP will maximize the protection and provides MinISR-1 tolerance
> to
> > data loss unclean shutdown events.
> >
> >
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas
> >
>

Re: [DISCUSS] KIP-966: Eligible Leader Replicas

Reply via email to