Re: [DISCUSS] KIP-1303: Deprioritize Tiered Storage Followers In Leader Election

Thomas Thornton via dev Thu, 28 May 2026 16:04:14 -0700

Hi Manan,

Thanks for the feedback. We've updated the KIP to reflect the
time/bytes-based approach. Please let us know your thoughts on the new
proposal. Thank you again for the discussion on this.


https://cwiki.apache.org/confluence/display/KAFKA/KIP-1303%3A+Deprioritize+Tiered+Storage+Followers+In+Leader+Election

Best,
Tom

On Wed, May 13, 2026 at 12:18 PM Thomas Thornton
<[email protected]> wrote:
>
> Hi Manan,
>
> Thanks for the feedback & discussion on this.
>
> We are thinking a time/bytes-based approach will work best here. The
> issue with the offset-based approach is that it doesn't scale well
> across different topic throughput rates. For high-throughput topics,
> the offset gap between existing and newly-bootstrapped replicas can be
> millions of messages, and convergence time approaches
> local.retention.ms (potentially hours) since the gap only closes as
> the old replica's local retention truncates segments (and eventually
> is within the offset range acceptable for leader election when
> compared to the newer replica). This means hours of rack imbalance
> even when all consumers are caught up, which is a worse operational
> state than the brief remote fetches this feature tries to prevent.
>
> A time/bytes threshold directly bounds this: operators set a value
> (e.g., 10 minutes, 100MB), and once a replica's local data spans that
> threhold (duration or size), it becomes eligible for leadership. This
> is intuitive, throughput-independent, and most broadly applicable
> across varied workloads.
>
> On operator confusion with local.retention.ms: we'd name these
> distinctly under a leader election namespace, e.g.:
>
> - leader.election.eligible.local.log.ms (default: -1, disabled)
> - leader.election.eligible.local.log.bytes (default: -1, disabled)
>
> These clearly describe leader election eligibility, not data retention
> policy. Setting either > 0 enables the feature. No separate boolean is
> needed, which simplifies the configuration compared to the original
> proposal.
>
> We are considering updating the KIP to reflect this approach.
> Interested to hear your thoughts.
>
> Thanks,
> Tom
>
> On Mon, May 11, 2026 at 2:07 PM Thomas Thornton
> <[email protected]> wrote:
> >
> > Hi Manan,
> >
> > Thanks for the detailed discussion on this.
> >
> > Yes, this is a temporary and expected trade-off.
> >
> > The skew self-resolves: once the new replicas close the offset
> > threshold gap, the stable sort treats them as equivalent and the
> > original assignment order is restored. The auto leader rebalancer
> > (runs every 5 minutes by default) then moves leadership back to the
> > rack-balanced preferred replica automatically. Convergence occurs once
> > the threshold is met (default 10k messages). In the case of high topic
> > throughput, the gap between existing and new replicas can be very
> > large (millions of messages), and convergence time may approach
> > local.retention.ms.
> >
> > One alternative design is that the leader preference is determined
> > based on a time threshold of local data. For example, it could be
> > configured to 10 minutes. This means that replicas whose local data
> > spans more than 10 minutes are preferred. The benefit here is that it
> > is decoupled from kafka topic throughput rate and allows operators to
> > easily set the maximum time a thin local data broker is de-prioritized
> > (and thus the max time some rack imbalance may be present). This
> > behaves well for consumers/followers that mostly keep up with
> > real-time, and thus would not trigger any unnecessary remote fetches.
> > We would consider updating the KIP to this approach if we agree it
> > provides better utility. Similar to other properties (e.g.,
> > retention.ms / retention.bytes), we'd propose this config as a
> > time/bytes pair. Interested to hear your thoughts on this.
> >
> > If operators prefer maintaing rack balance over avoiding remote
> > fetches, they simply don't enable the feature for the cluster, or
> > disable it per-topic via the topic-level override.
> >
> > Thanks,
> > Tom
> >
> > On Wed, Apr 29, 2026 at 8:48 AM Thomas Thornton
> > <[email protected]> wrote:
> > >
> > > Hi Manan,
> > >
> > > Thanks for the discussion.
> > >
> > > MG1: Good point. I've updated the KIP to add topic-level overrides for
> > > both configs so operators can tune the threshold per topic if the
> > > cluster-wide default doesn't fit a particular workload.
> > >
> > > MG2: This shouldn't cause skew. The localLogStartOffset sorting only
> > > pushes down newly-bootstrapped replicas that have significantly less
> > > local data. For existing replicas with comparable local data, they'll
> > > be within the threshold and fall back to the original assignment
> > > order, same behavior as today. We're not introducing a new dimension
> > > that would systematically favor certain racks or brokers.
> > >
> > > MG3: When tiered storage is not enabled for a topic, we will not send
> > > AlterPartition requests to report localLogStartOffset. There will be
> > > no extra control-plane overhead for clusters or topics that don't use
> > > tiered storage. When enabled, the additional AlterPartition calls only
> > > fire when an ISR member's localLogStartOffset actually changes, which
> > > reuses the existing protocol and should be infrequent.
> > >
> > > Thanks,
> > > Tom
> > >
> > > On Wed, Apr 29, 2026 at 5:42 PM Thomas Thornton
> > > <[email protected]> wrote:
> > > >
> > > > Hi Ivan,
> > > >
> > > > Thanks for the feedback.
> > > >
> > > > IY1: Yes, the sort is stable. Replicas within the threshold are
> > > > considered equivalent and retain their original assignment order.
> > > > We're only reordering replicas that have significantly less local
> > > > data, the existing replicas keep the same relative ordering as before.
> > > > Updated that part of the KIP to reflect this.
> > > >
> > > > IY2: Good idea. I've updated the KIP to add topic-level overrides for
> > > > both configs. This follows the standard Kafka pattern (like
> > > > `retention.ms`, `log.retention.ms`). The cluster-wide default applies
> > > > unless overridden per topic.
> > > >
> > > > Thanks,
> > > > Tom
> > > >
> > > > On Fri, Apr 24, 2026 at 3:18 PM Ivan Yurchenko <[email protected]> wrote:
> > > > >
> > > > > Hi Thomas,
> > > > >
> > > > > Thank you for the KIP. The motivation makes sense to me. I have a 
> > > > > couple of comments:
> > > > >
> > > > > IY1:
> > > > > > When `leader.election.prefer.early.local.log.start.offset is 
> > > > > > enabled`, the key change is to sort targetReplicas by 
> > > > > > local-log-start-offset (ascending) before selecting a leader. This 
> > > > > > ensures replicas with more local data (lower 
> > > > > > local-log-start-offset) are considered first in both election paths.
> > > > >
> > > > > I assume here it meant to say "sort stably", to preserve the original 
> > > > > preference order as much as possible?
> > > > >
> > > > > IY2:
> > > > > Can we find a reason for a particular topic to not follow the new 
> > > > > leader election algorithm, or it is strictly better and once enabled 
> > > > > it's not expected to be disabled? If the answer is yes, would you 
> > > > > consider adding the topic-level versions of the new configs 
> > > > > leader.election.prefer.early.local.log.start.offset and 
> > > > > leader.election.local.log.start.offset.threshold?
> > > > >
> > > > > Best,
> > > > > Ivan
> > > > >
> > > > >
> > > > > On Mon, Mar 30, 2026, at 20:43, Thomas Thornton via dev wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > We want to start a discussion thread for KIP-1303: Deprioritize Tiered
> > > > > Storage Followers In Leader Election.
> > > > >
> > > > > The adopted KIP-1023 introduced an optimization allowing followers to
> > > > > skip replicating data already in remote storage, dramatically reducing
> > > > > ISR join time. However, as noted in KIP-1023, this creates a risk: if
> > > > > such a follower becomes leader, it may need to serve consumer requests
> > > > > from remote storage, impacting performance.
> > > > >
> > > > > This KIP proposes to mitigate this risk by preferring replicas with
> > > > > more local data (lower localLogStartOffset) during leader election.
> > > > > Key changes include:
> > > > > 1) New config leader.election.prefer.early.local.log.start.offset to
> > > > > enable the feature
> > > > > 2) New config leader.election.local.log.start.offset.threshold to
> > > > > avoid leader churn from minor retention timing differences
> > > > > 3) Extending FetchRequest and AlterPartition to propagate
> > > > > localLogStartOffset from followers → leader → controller
> > > > >
> > > > > The full KIP is available here:
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1303%3A+Deprioritize+Tiered+Storage+Followers+In+Leader+Election
> > > > >
> > > > > Thanks,
> > > > > Tom
> > > > >
> > > > >

Re: [DISCUSS] KIP-1303: Deprioritize Tiered Storage Followers In Leader Election

Reply via email to