Hi Manan, Thanks for the feedback. We've updated the KIP to reflect the time/bytes-based approach. Please let us know your thoughts on the new proposal. Thank you again for the discussion on this.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1303%3A+Deprioritize+Tiered+Storage+Followers+In+Leader+Election Best, Tom On Wed, May 13, 2026 at 12:18 PM Thomas Thornton <[email protected]> wrote: > > Hi Manan, > > Thanks for the feedback & discussion on this. > > We are thinking a time/bytes-based approach will work best here. The > issue with the offset-based approach is that it doesn't scale well > across different topic throughput rates. For high-throughput topics, > the offset gap between existing and newly-bootstrapped replicas can be > millions of messages, and convergence time approaches > local.retention.ms (potentially hours) since the gap only closes as > the old replica's local retention truncates segments (and eventually > is within the offset range acceptable for leader election when > compared to the newer replica). This means hours of rack imbalance > even when all consumers are caught up, which is a worse operational > state than the brief remote fetches this feature tries to prevent. > > A time/bytes threshold directly bounds this: operators set a value > (e.g., 10 minutes, 100MB), and once a replica's local data spans that > threhold (duration or size), it becomes eligible for leadership. This > is intuitive, throughput-independent, and most broadly applicable > across varied workloads. > > On operator confusion with local.retention.ms: we'd name these > distinctly under a leader election namespace, e.g.: > > - leader.election.eligible.local.log.ms (default: -1, disabled) > - leader.election.eligible.local.log.bytes (default: -1, disabled) > > These clearly describe leader election eligibility, not data retention > policy. Setting either > 0 enables the feature. No separate boolean is > needed, which simplifies the configuration compared to the original > proposal. > > We are considering updating the KIP to reflect this approach. > Interested to hear your thoughts. > > Thanks, > Tom > > On Mon, May 11, 2026 at 2:07 PM Thomas Thornton > <[email protected]> wrote: > > > > Hi Manan, > > > > Thanks for the detailed discussion on this. > > > > Yes, this is a temporary and expected trade-off. > > > > The skew self-resolves: once the new replicas close the offset > > threshold gap, the stable sort treats them as equivalent and the > > original assignment order is restored. The auto leader rebalancer > > (runs every 5 minutes by default) then moves leadership back to the > > rack-balanced preferred replica automatically. Convergence occurs once > > the threshold is met (default 10k messages). In the case of high topic > > throughput, the gap between existing and new replicas can be very > > large (millions of messages), and convergence time may approach > > local.retention.ms. > > > > One alternative design is that the leader preference is determined > > based on a time threshold of local data. For example, it could be > > configured to 10 minutes. This means that replicas whose local data > > spans more than 10 minutes are preferred. The benefit here is that it > > is decoupled from kafka topic throughput rate and allows operators to > > easily set the maximum time a thin local data broker is de-prioritized > > (and thus the max time some rack imbalance may be present). This > > behaves well for consumers/followers that mostly keep up with > > real-time, and thus would not trigger any unnecessary remote fetches. > > We would consider updating the KIP to this approach if we agree it > > provides better utility. Similar to other properties (e.g., > > retention.ms / retention.bytes), we'd propose this config as a > > time/bytes pair. Interested to hear your thoughts on this. > > > > If operators prefer maintaing rack balance over avoiding remote > > fetches, they simply don't enable the feature for the cluster, or > > disable it per-topic via the topic-level override. > > > > Thanks, > > Tom > > > > On Wed, Apr 29, 2026 at 8:48 AM Thomas Thornton > > <[email protected]> wrote: > > > > > > Hi Manan, > > > > > > Thanks for the discussion. > > > > > > MG1: Good point. I've updated the KIP to add topic-level overrides for > > > both configs so operators can tune the threshold per topic if the > > > cluster-wide default doesn't fit a particular workload. > > > > > > MG2: This shouldn't cause skew. The localLogStartOffset sorting only > > > pushes down newly-bootstrapped replicas that have significantly less > > > local data. For existing replicas with comparable local data, they'll > > > be within the threshold and fall back to the original assignment > > > order, same behavior as today. We're not introducing a new dimension > > > that would systematically favor certain racks or brokers. > > > > > > MG3: When tiered storage is not enabled for a topic, we will not send > > > AlterPartition requests to report localLogStartOffset. There will be > > > no extra control-plane overhead for clusters or topics that don't use > > > tiered storage. When enabled, the additional AlterPartition calls only > > > fire when an ISR member's localLogStartOffset actually changes, which > > > reuses the existing protocol and should be infrequent. > > > > > > Thanks, > > > Tom > > > > > > On Wed, Apr 29, 2026 at 5:42 PM Thomas Thornton > > > <[email protected]> wrote: > > > > > > > > Hi Ivan, > > > > > > > > Thanks for the feedback. > > > > > > > > IY1: Yes, the sort is stable. Replicas within the threshold are > > > > considered equivalent and retain their original assignment order. > > > > We're only reordering replicas that have significantly less local > > > > data, the existing replicas keep the same relative ordering as before. > > > > Updated that part of the KIP to reflect this. > > > > > > > > IY2: Good idea. I've updated the KIP to add topic-level overrides for > > > > both configs. This follows the standard Kafka pattern (like > > > > `retention.ms`, `log.retention.ms`). The cluster-wide default applies > > > > unless overridden per topic. > > > > > > > > Thanks, > > > > Tom > > > > > > > > On Fri, Apr 24, 2026 at 3:18 PM Ivan Yurchenko <[email protected]> wrote: > > > > > > > > > > Hi Thomas, > > > > > > > > > > Thank you for the KIP. The motivation makes sense to me. I have a > > > > > couple of comments: > > > > > > > > > > IY1: > > > > > > When `leader.election.prefer.early.local.log.start.offset is > > > > > > enabled`, the key change is to sort targetReplicas by > > > > > > local-log-start-offset (ascending) before selecting a leader. This > > > > > > ensures replicas with more local data (lower > > > > > > local-log-start-offset) are considered first in both election paths. > > > > > > > > > > I assume here it meant to say "sort stably", to preserve the original > > > > > preference order as much as possible? > > > > > > > > > > IY2: > > > > > Can we find a reason for a particular topic to not follow the new > > > > > leader election algorithm, or it is strictly better and once enabled > > > > > it's not expected to be disabled? If the answer is yes, would you > > > > > consider adding the topic-level versions of the new configs > > > > > leader.election.prefer.early.local.log.start.offset and > > > > > leader.election.local.log.start.offset.threshold? > > > > > > > > > > Best, > > > > > Ivan > > > > > > > > > > > > > > > On Mon, Mar 30, 2026, at 20:43, Thomas Thornton via dev wrote: > > > > > > > > > > Hi all, > > > > > > > > > > We want to start a discussion thread for KIP-1303: Deprioritize Tiered > > > > > Storage Followers In Leader Election. > > > > > > > > > > The adopted KIP-1023 introduced an optimization allowing followers to > > > > > skip replicating data already in remote storage, dramatically reducing > > > > > ISR join time. However, as noted in KIP-1023, this creates a risk: if > > > > > such a follower becomes leader, it may need to serve consumer requests > > > > > from remote storage, impacting performance. > > > > > > > > > > This KIP proposes to mitigate this risk by preferring replicas with > > > > > more local data (lower localLogStartOffset) during leader election. > > > > > Key changes include: > > > > > 1) New config leader.election.prefer.early.local.log.start.offset to > > > > > enable the feature > > > > > 2) New config leader.election.local.log.start.offset.threshold to > > > > > avoid leader churn from minor retention timing differences > > > > > 3) Extending FetchRequest and AlterPartition to propagate > > > > > localLogStartOffset from followers → leader → controller > > > > > > > > > > The full KIP is available here: > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1303%3A+Deprioritize+Tiered+Storage+Followers+In+Leader+Election > > > > > > > > > > Thanks, > > > > > Tom > > > > > > > > > >
