hi Jun

The most important part of this story is how users should expect the data
they can see when using the latest or by_duration policy with expanded
partitions.

Yes, the by_duration policy can minimize data loss, but it is
non-deterministic, which means users will either read too many historical
records from existing partitions or lose some records from expanded
partitions.

Also, I agree that auto.offset.reset.max.age.ms is a bit hard to
understand, and that is why I preferred having a whole new policy based
entirely on group creation time (KIP-1282)

Best,
Chia-Ping

Jun Rao via dev <[email protected]> 於 2026年6月16日週二 上午1:08寫道:

> Hi, Chia-Ping and Jiunn-Yang,
>
> Thanks for the reply. I am still trying to understand the value of the new
> configs with the KIP.
>
> The motivation of the KIP is that a user doesn't want to miss the data if
> the backlog is small. The backlog of the existing partition is easy to
> understand because it relates to retention time. The backlog for the new
> partition is a bit subtle to understand since it depends on the metadata
> refresh delay. To set auto.offset.reset.max.age.ms, the user needs to
> understand the metadata refresh delay on the consumer side and use it to
> set the config.
>
> Now, let's consider the alternative: setting the same value for the
> existing by_duration policy. The KIP lists three issues with this approach.
> 1. It computes the seek target client-side as now() - duration, which
> introduces clock skew across consumers and forces operators to choose
> overly large durations, causing unnecessary reprocessing.
> 2. The target timestamp is recomputed on each retry, so failed
> ListOffsetsRequest retries can shift the target forward and potentially
> miss records produced between attempts.
> 3. It applies uniformly to all partitions without committed offsets, and
> cannot distinguish newly expanded partitions from long-existing partitions
> newly assigned to the group, leading to unnecessary replay.
>
> Issues 1 and 2 are uncommon and can be mitigated by adding a bit buffer to
> the metadata refresh delay. We could also consider improving the
> implementation. For issue 3, the metadata refresh delay is typically low
> (in the order of minutes with the classic consumer and tens of seconds with
> the new consumer). If a user is ok with reading that much backlog for new
> partitions, it seems they will be ok doing the same for existing
> partitions.
>
> So, instead of introducing a new config, could we just reuse the existing
> config with better documentation and/or implementation?
>
> Jun
>
>
> On Sat, Jun 13, 2026 at 12:19 AM 黃竣陽 <[email protected]> wrote:
>
> > Hello Jun,
> >
> > You're right that group creation time is the more intuitive answer at
> > first glance,
> > the KIP's own motivation talks about partitions that "predate the group"
> > vs partitions
> > "created during group runtime," which directly points to a
> group-lifecycle
> > classifier.
> > I'd like to walk through why we landed on partition age, and the
> > trade-offs we considered.
> >
> > We evaluated three candidate signals:
> >
> > 1. `by_duration:5secs`
> >
> > This covers the metadata blindness window, but has issues the KIP
> > currently documents
> > under "Why not use `by_duration`?":
> >
> > - Client-side `now() - duration` introduces clock skew across consumers.
> > - `ListOffsets` retries shift the target forward, potentially missing
> > records produced between
> > attempts.
> > - It applies uniformly to all partitions without committed offsets,
> > including pre-existing partitions
> > newly assigned to the group, causing unnecessary replay.
> >
> > 2. Group creation time as classifier
> >
> > This works cleanly when the consumer is actively running. Our concern
> > is the idle / late-rejoin case:
> >
> > T=0:         Group created.
> > T=1..T=100:  Consumer idle (down, disconnected, etc.).
> > T=50:        Partition added during the idle window.
> > T=100:       Consumer resumes.
> >
> > Under group creation time, the new partition is classified as new
> > (`50 > 0`) and reset to `earliest`, replaying everything from T=50.
> > But during `[T=1, T=100]`, base partitions also accumulated data that
> > the consumer accepts as lost — that is precisely the contract of
> > `auto.offset.reset=latest`. There is no principled reason to treat
> > the new partition differently; both contain backlog accumulated during
> > the same idle window.
> >
> > This aligns with the "backlog is backlog” principle you raised in
> > the KIP-1282 thread: a `latest` user has tolerated some backlog on
> > every other partition during the same idle period; forcing 0-backlog
> > tolerance only on new partitions would be inconsistent with that
> > tolerance.
> >
> > 3. Partition age vs threshold
> >
> > Partition age corresponds to the actual silent data loss window,
> > the gap between partition creation and the consumer’s metadata
> > refresh. Within this window, data loss is genuinely silent: the
> > consumer had no opportunity to know about the partition. Outside this
> > window, missing data reflects either:
> >
> > - (a) the user’s tolerated cost of running with idle consumers, or
> > - (b) an operational issue to surface via monitoring, not via reset
> policy.
> >
> > We did not choose partition age because it is more elegant than group
> > creation time — we chose it because its failure mode (requires a
> > threshold) is
> > less invasive than the failure mode of group creation time (overrides
> > user-stated
> > `latest` intent during idle periods).
> >
> > Best Regards,
> > Jiunn-Yang
> >
> > > Chia-Ping Tsai <[email protected]> 於 2026年6月13日 上午11:52 寫道:
> > >
> > > Hi Jun,
> > >
> > > Relying on both creation times will create an inconsistent scenario. A
> > > consumer that lost all offsets due to a long sleep will seek to the
> > > beginning for the partitions created later than the group.
> > >
> > > That is why we initially proposed KIP-1282 to fix the inconsistency
> > using a
> > > whole new policy. Since KIP-1282 couldn't reach a consensus, KIP-1327
> > goes
> > > back to using flexible configurations to prevent users from falling
> into
> > > that pitfall.
> > >
> > > Best, Chia-Ping
> > >
> > > Jun Rao via dev <[email protected]> 於 2026年6月13日週六 上午6:49寫道:
> > >
> > >> Hi, Jiunn-Yang,
> > >>
> > >> Thanks for the reply and sorry for the late reply.
> > >>
> > >> JR1. The design of auto.offset.reset.max.age.ms still feels weird to
> > me.
> > >> It
> > >> categorizes partitions as new or existing based on the partition
> > creation
> > >> time. Intuitively, the categorization should be based on the group
> > creation
> > >> time: all partitions existing when the group is created are existing
> and
> > >> all partitions created after the group creation are new partitions.
> > >>
> > >> Jun
> > >>
> > >>
> > >>
> > >> On Tue, Jun 9, 2026 at 8:51 AM 黃竣陽 <[email protected]> wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> Manually bumping this thread. If there is no further
> > >>> discussion, I will close the vote.
> > >>>
> > >>> Best Regards,
> > >>> Jiunn-Yang
> > >>>
> > >>>> 黃竣陽 <[email protected]> 於 2026年6月1日 晚上7:16 寫道:
> > >>>>
> > >>>> Hello Jian,
> > >>>>
> > >>>> Thanks for your feedback,
> > >>>>
> > >>>> Agreed, partition expansion is a common operational task, not an
> edge
> > >>>> case. I've updated the Motivation section accordingly.
> > >>>>
> > >>>> Best Regards,
> > >>>> Jiunn-Yang
> > >>>>
> > >>>>> jian fu <[email protected]> 於 2026年6月1日 下午5:49 寫道:
> > >>>>>
> > >>>>> Hi Jiunn-Yang:
> > >>>>>
> > >>>>> Thanks for the KIP. I think it would be useful to clarify that this
> > >> is a
> > >>>>> common scenario rather than an edge case, which further
> demonstrates
> > >> the
> > >>>>> need for this optimization. For example:
> > >>>>> A partition expansion is a common operational task in Kafka: To
> > >> balance
> > >>>>> resource utilization and cost, topics are typically created with a
> > >>> moderate
> > >>>>> default partition count. However, as traffic grows over time, it is
> > >>> often
> > >>>>> necessary to increase the number of partitions to accommodate the
> > >> higher
> > >>>>> workload.
> > >>>>>
> > >>>>> Regards
> > >>>>> Jian
> > >>>>>
> > >>>>> 黃竣陽 <[email protected]> 于2026年5月30日周六 22:31写道:
> > >>>>>
> > >>>>>> Hello chia,
> > >>>>>>
> > >>>>>> Thanks for the comments, I have updated the KIP!
> > >>>>>>
> > >>>>>> Best Regards,
> > >>>>>> Jiunn-Yang
> > >>>>>>
> > >>>>>>> Chia-Ping Tsai <[email protected]> 於 2026年5月30日 晚上8:29 寫道:
> > >>>>>>>
> > >>>>>>> Hi Jiunn-Yang,
> > >>>>>>>
> > >>>>>>> Would you mind removing the terms "hot" and "cold" when
> describing
> > >>>>>>> partitions in the KIP? I understand you are using them to
> describe
> > >> the
> > >>>>>>> "freshness" or the users' need for the records, but applying
> these
> > >>> terms
> > >>>>>> to
> > >>>>>>> the partition itself feels a bit unnatural.
> > >>>>>>>
> > >>>>>>> After all, in this scenario, users don't really care whether a
> > >>> partition
> > >>>>>> is
> > >>>>>>> newly expanded or not. Their only expectation is that they won't
> > >>> silently
> > >>>>>>> lose any live records produced to the topic during their active
> > >>>>>> consumption.
> > >>>>>>>
> > >>>>>>> Best, Chia-Ping
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> 黃竣陽 <[email protected]> 於 2026年5月30日週六 下午12:30寫道:
> > >>>>>>>
> > >>>>>>>> Hello Jun,
> > >>>>>>>>
> > >>>>>>>> Thanks for the feedback, I have updated the KIP motivation
> > section.
> > >>>>>>>>
> > >>>>>>>> Best Regards,
> > >>>>>>>> Jiunn-Yang
> > >>>>>>>>
> > >>>>>>>>> Jun Rao via dev <[email protected]> 於 2026年5月30日 凌晨1:12 寫道:
> > >>>>>>>>>
> > >>>>>>>>> Hi, Jiunn-Yang,
> > >>>>>>>>>
> > >>>>>>>>> Thanks for the reply. I think we need a stronger motivation for
> > >> the
> > >>>>>> KIP.
> > >>>>>>>>>
> > >>>>>>>>> The KIP says "The core insight is that not all partitions
> without
> > >> a
> > >>>>>>>>> committed offset are the same. A newly expanded partition (hot)
> > is
> > >>>>>>>>> fundamentally different from a partition the consumer has never
> > >> seen
> > >>>>>>>>> because it predates the group (cold)." Why is the hot partition
> > >>>>>>>>> fundamentally different from the cold?
> > >>>>>>>>>
> > >>>>>>>>> The KIP says "The existing by_duration policy is also
> > insufficient
> > >>>>>>>> because:
> > >>>>>>>>>
> > >>>>>>>>> - The calculated seek time (now() - duration) varies across
> nodes
> > >>> due
> > >>>>>>>> to
> > >>>>>>>>> clock skew. To be safe, users must set an overly large
> duration,
> > >>>>>>>> causing
> > >>>>>>>>> unnecessary reprocessing.
> > >>>>>>>>> - On network errors, the client recalculates the seek time on
> > >> retry,
> > >>>>>>>>> shifting the target timestamp forward and risking data loss."
> > >>>>>>>>>
> > >>>>>>>>> However, both of these situations are rare. If these issues
> > >> persist,
> > >>>>>> more
> > >>>>>>>>> severe problems likely exist elsewhere. Rare situations don't
> > >> need a
> > >>>>>>>> common
> > >>>>>>>>> solution. If users care about those rare situations, they can
> > >>> implement
> > >>>>>>>>> customized logic using
> > >>>>>> ConsumerRebalanceListener.onPartitionsAssigned().
> > >>>>>>>>>
> > >>>>>>>>> Jun
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Sun, May 17, 2026 at 6:50 AM 黃竣陽 <[email protected]>
> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hello chia,
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks for the feedback,
> > >>>>>>>>>>
> > >>>>>>>>>>> If the creation time exists, the returned value should always
> > be
> > >>>>>>>> greater
> > >>>>>>>>>> than or equal to zero, right?
> > >>>>>>>>>> I have explicitly mentioned this in the KIP.
> > >>>>>>>>>>
> > >>>>>>>>>>>> New  Old (MetadataResponse v0–13)    positive        any
> > >>> field
> > >>>>>>>>>> absent    UnsupportedVersionException
> > >>>>>>>>>>
> > >>>>>>>>>> The earliest point at which we can detect the version mismatch
> > is
> > >>>>>> during
> > >>>>>>>>>> the
> > >>>>>>>>>> first metadata fetch after assignment, which occurs inside
> > >> poll().
> > >>>>>>>>>> Therefore, the
> > >>>>>>>>>> user would encounter an UnsupportedVersionException from
> poll().
> > >>> I’ll
> > >>>>>>>>>> clarify this in the KIP.
> > >>>>>>>>>>
> > >>>>>>>>>> Best Regards,
> > >>>>>>>>>> Jiunn-Yang
> > >>>>>>>>>>
> > >>>>>>>>>>> Chia-Ping Tsai <[email protected]> 於 2026年5月17日 下午4:50 寫道:
> > >>>>>>>>>>>
> > >>>>>>>>>>> hi Jiunn
> > >>>>>>>>>>>
> > >>>>>>>>>>>> PartitionAgeMs (int64, default -1): The age of this
> partition
> > >> in
> > >>>>>>>>>> milliseconds, computed server-side by the broker as
> > >>>>>> broker_current_time
> > >>>>>>>> -
> > >>>>>>>>>> partition_creation_time. Returns -1 if the broker does not
> > >> support
> > >>>>>> this
> > >>>>>>>>>> feature or the partition creation time is unknown.
> > >>>>>>>>>>>
> > >>>>>>>>>>> If the creation time exists, the returned value should always
> > be
> > >>>>>>>> greater
> > >>>>>>>>>> than or equal to zero, right?
> > >>>>>>>>>>>
> > >>>>>>>>>>>> New  Old (MetadataResponse v0–13)    positive        any
> > >>> field
> > >>>>>>>>>> absent    UnsupportedVersionException
> > >>>>>>>>>>>
> > >>>>>>>>>>> Will user encounter UnsupportedVersionException when calling
> > >>>>>> `poll()`?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Chia-Ping
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 2026/05/16 04:30:49 黃竣陽 wrote:
> > >>>>>>>>>>>> Hello Jun, chia,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I've updated KIP-1327 with a design change based on the
> > >>> discussion
> > >>>>>>>>>>>> feedback.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The updated design decouples the new-partition reset
> behavior
> > >>> from
> > >>>>>>>>>>>> the base auto.offset.reset policy:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - auto.offset.reset.max.age.ms now applies to all
> > >>> auto.offset.reset
> > >>>>>>>>>> values
> > >>>>>>>>>>>> (latest, earliest, by_duration, none).
> > >>>>>>>>>>>> - For new ("hot") partitions, the consumer resets to
> > >>>>>>>>>> auto.offset.reset.new.partitions
> > >>>>>>>>>>>> config setting
> > >>>>>>>>>>>> - For existing ("cold") partitions, the base
> auto.offset.reset
> > >>>>>> policy
> > >>>>>>>>>> continues
> > >>>>>>>>>>>> to apply unchanged.
> > >>>>>>>>>>>> - The new-partition reset behavior is represented by a
> > separate
> > >>>>>>>>>> internal config
> > >>>>>>>>>>>> (auto.offset.reset.new.partitions, currently fixed to
> > >> earliest).
> > >>>>>> This
> > >>>>>>>>>> decoupled design makes
> > >>>>>>>>>>>> it straightforward to promote the behavior to a public
> > >>> user-facing
> > >>>>>>>>>> configuration in a future KIP.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>> Jiunn-Yang
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Chia-Ping Tsai <[email protected]> 於 2026年5月16日 清晨7:46
> 寫道:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> hi Jun
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I see what you mean now. The proposal from me is listed
> > below:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 1) Add auto.offset.reset.new.partitions with a default
> value
> > >> of
> > >>>>>>>>>> earliest. It fixes the data loss from both by_duration and
> > >> latest,
> > >>> and
> > >>>>>>>> it
> > >>>>>>>>>> does not change the logic of auto.offset.reset=earliest.
> > >>>>>>>>>>>>> 2) Mark auto.offset.reset.new.partitions as an internal
> > >>>>>>>>>> configuration. auto.offset.reset.new.partitions=earliest
> > already
> > >>>>>>>>>> addresses the issue, and we can discuss the use cases of other
> > >>> values
> > >>>>>>>> in a
> > >>>>>>>>>> separate KIP.
> > >>>>>>>>>>>>> 3) Both configs, auto.offset.reset.new.partitions and
> > >>>>>>>>>> auto.offset.reset.latest.max.age.ms, will be applied to all
> for
> > >>>>>>>>>> consistency.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> WDYT?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 2026/05/15 20:53:20 Jun Rao via dev wrote:
> > >>>>>>>>>>>>>> Hi, Chia-Ping,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks for the reply.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 1. In the motivation section, the KIP says "When a Kafka
> > >> topic
> > >>> is
> > >>>>>>>>>> expanded
> > >>>>>>>>>>>>>> with new partitions, consumers using the latest auto
> offset
> > >>> reset
> > >>>>>>>>>> policy
> > >>>>>>>>>>>>>> will silently miss all records produced to those
> partitions
> > >>> before
> > >>>>>>>> the
> > >>>>>>>>>>>>>> consumer discovers them.". If a user sets
> > >>>>>>>>>>>>>> auto.offset.reset=by_duration=1sec, the same record loss
> > >> issue
> > >>>>>> could
> > >>>>>>>>>> also
> > >>>>>>>>>>>>>> happen, right?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 2. I was thinking auto.offset.reset.new.partitions will
> > take
> > >>> the
> > >>>>>>>> same
> > >>>>>>>>>>>>>> values as auto.offset.reset. So a user could set it
> > >>> by_duration if
> > >>>>>>>>>> needed.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Jun
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Thu, May 14, 2026 at 4:06 PM Chia-Ping Tsai <
> > >>>>>> [email protected]
> > >>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> hi Jun
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks for the feedback. I might be missing something
> > >>> important
> > >>>>>>>> from
> > >>>>>>>>>> your
> > >>>>>>>>>>>>>>> suggestion, so please bear with me as I try to clarify
> with
> > >> a
> > >>> few
> > >>>>>>>>>> questions:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 1. Is there a strong use case for extending this logic to
> > >>> other
> > >>>>>>>> reset
> > >>>>>>>>>>>>>>> policies? Unlike latest, policies like earliest or
> > >> by_duration
> > >>>>>>>> don't
> > >>>>>>>>>> seem
> > >>>>>>>>>>>>>>> to suffer from the same silent data loss issue when a
> > >>> partition
> > >>>>>> is
> > >>>>>>>>>> expanded.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 2. What values would we expect users to configure for
> > >>>>>>>>>>>>>>> auto.offset.reset.new.partitions? If they set it to
> > >> earliest
> > >>> or
> > >>>>>>>>>> latest,
> > >>>>>>>>>>>>>>> we might run into the exact same edge cases. For example,
> > >> if a
> > >>>>>>>>>> consumer is
> > >>>>>>>>>>>>>>> offline for a while and a new partition is created during
> > >> that
> > >>>>>>>>>> downtime,
> > >>>>>>>>>>>>>>> the user might actually want to skip to latest when
> > >> resuming,
> > >>>>>>>> rather
> > >>>>>>>>>> than
> > >>>>>>>>>>>>>>> reading from earliest just because the partition is
> > >>> technically
> > >>>>>>>>>> "new" to
> > >>>>>>>>>>>>>>> the group.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> This is exactly why we opted for introducing a max.age
> > >>> threshold.
> > >>>>>>>> It
> > >>>>>>>>>> gives
> > >>>>>>>>>>>>>>> users a time-bound way to define what is genuinely
> > "hot/new"
> > >>> and
> > >>>>>>>>>> what is
> > >>>>>>>>>>>>>>> just an old partition they haven't seen yet.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>> Chia-Ping
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On 2026/05/14 20:48:09 Jun Rao via dev wrote:
> > >>>>>>>>>>>>>>>> Hi, Jiunn-Yang,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks for the KIP.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I find auto.offset.reset.latest.max.age a bit weird. It
> > >> only
> > >>>>>>>>>> applies when
> > >>>>>>>>>>>>>>>> auto.offset.reset is latest. However, it seems that the
> > >>>>>> motivation
> > >>>>>>>>>>>>>>> equally
> > >>>>>>>>>>>>>>>> applies when auto.offset.reset is set to other values
> like
> > >>>>>>>>>> by_duration.
> > >>>>>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>> intention is that we want to have a separate way to
> > control
> > >>>>>> newly
> > >>>>>>>>>> created
> > >>>>>>>>>>>>>>>> partitions vs existing partitions when the group starts.
> > >>> Have we
> > >>>>>>>>>>>>>>> considered
> > >>>>>>>>>>>>>>>> adding a new config like auto.offset.reset.new
> > .partitions?
> > >>> If
> > >>>>>>>> this
> > >>>>>>>>>> new
> > >>>>>>>>>>>>>>>> config is not set, the offset reset policy defaults to
> the
> > >>>>>> policy
> > >>>>>>>>>> used
> > >>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>> existing partitions. The user could set it explicitly to
> > >>>>>> customize
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>>> behavior for new partitions.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Jun
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Thu, May 7, 2026 at 5:07 AM 黃竣陽 <[email protected]>
> > >>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I’d like to manually bump this thread.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>>>> Jiunn-Yang
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> 黃竣陽 <[email protected]> 於 2026年5月1日 晚上10:37 寫道:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Hello all,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Thanks for the feedback.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> DJ01/DJ02:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> MetadataResponse bumps from v13 to v14. The
> > >>> PartitionMetadata
> > >>>>>>>>>> struct
> > >>>>>>>>>>>>>>>>> gains a new
> > >>>>>>>>>>>>>>>>>> field PartitionAgeMs (int64, default -1), computed
> > >>> server-side
> > >>>>>>>> by
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> broker as
> > >>>>>>>>>>>>>>>>>> broker_current_time - partition_creation_time.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Also add the consumer heartbeat flow. when
> > >>> MembershipManager
> > >>>>>>>>>> detects
> > >>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>> newly assigned
> > >>>>>>>>>>>>>>>>>> partition, it explicitly invalidates the metadata for
> > the
> > >>>>>>>> affected
> > >>>>>>>>>>>>>>> topic
> > >>>>>>>>>>>>>>>>> and forces a fresh MetadataRequest
> > >>>>>>>>>>>>>>>>>> before making the offset reset decision, even if the
> > >> topic
> > >>> ID
> > >>>>>> is
> > >>>>>>>>>>>>>>> already
> > >>>>>>>>>>>>>>>>> in the cache.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> MB0:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The consumer learns the broker's maximum supported
> > >>>>>>>>>> MetadataResponse
> > >>>>>>>>>>>>>>>>> version via the
> > >>>>>>>>>>>>>>>>>> ApiVersions negotiation at connection time. If the
> > >>> negotiated
> > >>>>>>>>>>>>>>> version is
> > >>>>>>>>>>>>>>>>> unsupported, the consumer
> > >>>>>>>>>>>>>>>>>> knows the broker does not support PartitionAgeMs at
> all
> > >> and
> > >>>>>> can
> > >>>>>>>>>>>>>>> throw an
> > >>>>>>>>>>>>>>>>> UnsupportedVersionException
> > >>>>>>>>>>>>>>>>>> immediately, rather than silently falling back to
> latest
> > >>> and
> > >>>>>>>>>> risking
> > >>>>>>>>>>>>>>>>> data loss without any operator-visible signal.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> MB1/MB2/MB3:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> I have addressed these changes in the KIP.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>>>>> Jiunn-Yang
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Chia-Ping Tsai <[email protected]> 於 2026年4月29日
> > 下午4:04
> > >>> 寫道:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> hi David
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> I agree with the direction of moving the 'age'
> > >> resolution
> > >>>>>> from
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> Heartbeat API to the Metadata API to keep the control
> > >> plane
> > >>>>>>>> clean.
> > >>>>>>>>>> The
> > >>>>>>>>>>>>>>> main
> > >>>>>>>>>>>>>>>>> trade-off, as we noted before, is introducing
> > inter-broker
> > >>>>>> clock
> > >>>>>>>>>> skew.
> > >>>>>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>>> Group Coordinator approach provided a single source of
> > >> truth
> > >>>>>> for
> > >>>>>>>>>> time.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> However, realistically, this time skew should be
> > >>> negligible.
> > >>>>>>>>>> Given
> > >>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>> the max.age threshold will likely be configured in
> > minutes
> > >>> or
> > >>>>>>>>>> hours, a
> > >>>>>>>>>>>>>>>>> typical NTP skew (in milliseconds) between brokers
> won't
> > >>> impact
> > >>>>>>>> the
> > >>>>>>>>>>>>>>>>> fallback decision.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>> Chia-Ping
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> David Jacot via dev <[email protected]> 於
> > >> 2026年4月29日
> > >>>>>>>> 下午3:29
> > >>>>>>>>>> 寫道:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Thanks for the KIP!
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Sorry, I haven't really followed the previous
> > >>> conversation
> > >>>>>>>> but I
> > >>>>>>>>>>>>>>> took a
> > >>>>>>>>>>>>>>>>>>>> quick look at this one.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> DJ01: I don't clearly understand the flow with the
> > >>>>>>>>>>>>>>>>> ConsumerGroupHeartbeat
> > >>>>>>>>>>>>>>>>>>>> API after reading the KIP. There is a new boolean;
> the
> > >>> KIP
> > >>>>>>>>>> states
> > >>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>> partition ages are returned only when this boolean
> is
> > >>> set.
> > >>>>>>>>>>>>>>> Implicitly,
> > >>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>> means that when the consumer receives a new
> partition,
> > >> it
> > >>>>>> will
> > >>>>>>>>>>>>>>> issue a
> > >>>>>>>>>>>>>>>>> new
> > >>>>>>>>>>>>>>>>>>>> HB request with the boolean set to receive the ages.
> > Is
> > >>> my
> > >>>>>>>>>>>>>>>>> understanding
> > >>>>>>>>>>>>>>>>>>>> correct? We should perhaps clarify the flow and also
> > >>> explain
> > >>>>>>>>>> how it
> > >>>>>>>>>>>>>>>>> fits
> > >>>>>>>>>>>>>>>>>>>> into the existing flow (e.g. list offsets, fetch
> > >> offsets,
> > >>>>>>>> etc.).
> > >>>>>>>>>>>>>>>>>>>> DJ02: It my understanding is correct, I wonder if
> > >>>>>>>>>>>>>>>>>>>> the ConsumerGroupHeartbeat API is the right place
> for
> > >>> this
> > >>>>>>>> given
> > >>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>> a new
> > >>>>>>>>>>>>>>>>>>>> round trip is done anyway. Alternatively, it could
> > >> simply
> > >>>>>>>>>> include
> > >>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>> metadata. Generally, we should be rather cautious
> > about
> > >>> not
> > >>>>>>>>>>>>>>> overloading
> > >>>>>>>>>>>>>>>>>>>> the ConsumerGroupHeartbeat API with unrelated
> > concepts.
> > >>> The
> > >>>>>>>> API
> > >>>>>>>>>> is
> > >>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>> control plane API for assigning or revoking
> > partitions.
> > >>> The
> > >>>>>>>> fact
> > >>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>> don't want to add it to the corresponding Streams
> API
> > >>> also
> > >>>>>>>>>> suggests
> > >>>>>>>>>>>>>>>>>>>> something is not quite right. What would we do if we
> > >>> want to
> > >>>>>>>>>>>>>>> support
> > >>>>>>>>>>>>>>>>>>>> Streams in the future?
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>> David
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> On Wed, Apr 29, 2026 at 12:28 AM Muralidhar Basani
> > via
> > >>> dev
> > >>>>>> <
> > >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Hi Jiunn,
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Thank you for this great kip. Good to know about
> the
> > >>> gap.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> mb-0 - why a new v2 version bump for
> > >>> RequestPartitionAges
> > >>>>>>>>>> field.
> > >>>>>>>>>>>>>>> Can a
> > >>>>>>>>>>>>>>>>>>>>> tagged field (for ex: on response, PartitionAges on
> > >>>>>>>>>>>>>>> TopicPartitions)
> > >>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>> used here and avoid version bump?
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> mb-1 - For the new config, is there a recommended
> > >> value
> > >>> or
> > >>>>>> a
> > >>>>>>>>>>>>>>> ConfigDef
> > >>>>>>>>>>>>>>>>>>>>> validator? Probably it should based on the
> > >>>>>>>> metadata.max.age.ms
> > >>>>>>>>>> ?
> > >>>>>>>>>>>>>>>>> Sizing
> > >>>>>>>>>>>>>>>>>>>>> instructions can be part of javadocs I guess.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> mb-2 - (minor) As there are no changes to Kafka
> > >> Streams,
> > >>>>>>>> would
> > >>>>>>>>>> it
> > >>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>> better
> > >>>>>>>>>>>>>>>>>>>>> to add this new config
> > >> auto.offset.reset.latest.max.age
> > >>> to
> > >>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>> StreamsConfig block list
> > >>>>>>>>>>>>>>> (NON_CONFIGURABLE_CONSUMER_DEFAULT_CONFIGS)
> > >>>>>>>>>>>>>>>>> for a
> > >>>>>>>>>>>>>>>>>>>>> clear warning, incase users configure it? This is
> the
> > >>> most
> > >>>>>>>>>>>>>>> familiar
> > >>>>>>>>>>>>>>>>>>>>> consumer config and users might easily mistakenly
> > >>> configure
> > >>>>>>>>>> it. Or
> > >>>>>>>>>>>>>>>>> may be
> > >>>>>>>>>>>>>>>>>>>>> it's not worth it to add.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> mb-3 - (minor) The phrasing "the consumer falls
> back
> > >> to
> > >>>>>>>>>> earliest"
> > >>>>>>>>>>>>>>>>> reads as
> > >>>>>>>>>>>>>>>>>>>>> if the config were being changed per-partition
> which
> > >>> isn't
> > >>>>>>>>>>>>>>> supported.
> > >>>>>>>>>>>>>>>>> May
> > >>>>>>>>>>>>>>>>>>>>> be rephrasing to something like "consumer resolves
> > the
> > >>>>>>>> initial
> > >>>>>>>>>>>>>>>>> position to
> > >>>>>>>>>>>>>>>>>>>>> start offset for that partition" as if earliest was
> > >>> applied
> > >>>>>>>> to
> > >>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>> partition only and auto.offset.reset config is
> > >>> unchanged.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>> Murali
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 28, 2026 at 2:48 PM 黃竣陽 <
> > >>> [email protected]>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Hi chia,
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> I have updated the KIP to include this change.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>>>>>>>>> Jiunn-Yang
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Chia-Ping Tsai <[email protected]> 於
> 2026年4月28日
> > >>> 晚上8:03
> > >>>>>>>> 寫道:
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> hi Jiunn-Yang
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> chia_0: Should we expose the partition creation
> > time
> > >>> via
> > >>>>>>>> the
> > >>>>>>>>>>>>>>> Admin
> > >>>>>>>>>>>>>>>>> API?
> > >>>>>>>>>>>>>>>>>>>>>> I assume it would be valuable for users to
> diagnose
> > >> and
> > >>>>>>>>>>>>>>> troubleshoot
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>> behavior of auto.offset.reset.latest.max.age
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>> Chia-Ping
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> On 2026/04/28 10:47:58 黃竣陽 wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>> Hello everyone,
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> I would like to start a discussion on KIP-1327
> > >>> Prevent
> > >>>>>> Hot
> > >>>>>>>>>> Data
> > >>>>>>>>>>>>>>>>> Loss
> > >>>>>>>>>>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>>>>>> Partition Expansion for Latest Policy
> > >>>>>>>>>>>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/x/KY4mGQ__;!!Ayb5sqE7!qF4q1QzF1RRgP61D7A2xuEai1ky7fepKDKFFvpNBuePikH-ULmT87TvuuZzy5kau5E4y5zMZAmfQQiwZomM$
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> This proposal aims to introduces
> > >>>>>>>>>>>>>>> auto.offset.reset.latest.max.age,
> > >>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>>>> consumer config that lets the
> > >>>>>>>>>>>>>>>>>>>>>>>> latest reset policy distinguish newly expanded
> > >> (hot)
> > >>>>>>>>>> partitions
> > >>>>>>>>>>>>>>>>> from
> > >>>>>>>>>>>>>>>>>>>>>> long-existing (cold) ones. Partitions
> > >>>>>>>>>>>>>>>>>>>>>>>> younger than the configured threshold
> > automatically
> > >>> fall
> > >>>>>>>>>> back
> > >>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>> earliest, preventing silent data loss
> > >>>>>>>>>>>>>>>>>>>>>>>> during topic expansion without forcing a full
> > >>> historical
> > >>>>>>>>>>>>>>> reprocess.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>> Jiunn-Yang
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> >
> >
>

Reply via email to