Hello Matthias,

Thank you for the feedback.

I really like the proposal to change state.cleanup.on.start from a boolean
to a long (with a default of -1). Do we need to change naming then?
Proposal: state.cleanup.on.start.delay.ms
Decoupling this from state.cleanup.delay.ms ensures the new feature doesn't
have unintended side effects. It also gives users the flexibility to align
the cleanup threshold with their delete.retention.ms settings. For example,
if the retention is set to 24 hours, a user could safely set the cleanup
property to 20 hours (or even closer to retention value)

Regarding the global store case, I believe this approach helps there as
well. Even if a less-frequently updated global store is wiped, it would
only occur according to the specific threshold the user has defined, which
is a manageable trade-off.

I have updated the KIP accordingly.

Best regards,
Uladzislau Blok

On Tue, Jan 27, 2026 at 8:19 AM Matthias J. Sax <[email protected]> wrote:

> Thanks for raising both points.
>
> The global store one is tricky. Not sure atm. The good thing is of
> course, that this new feature is disable by default. Maybe it would be
> sufficient to call out this edge case in the docs explicitly, calling
> for caution, but leave it up the user to decide? -- Maybe others have
> some ideas?
>
>
> About increasing `state.cleanup.delay.ms` -- I am not convinced it would
> be a good idea. I would propose two alternatives.
>
>   - extend the doc to tell users to consider increasing this config, if
> they use this new feature
>
>   - change `state.cleanup.on.start` from a boolean to a long, with
> default value `-1` (for disabled) and let users decide what age
> threshold they want to apply when enabling the feature, effectively
> decoupling the new feature from `state.cleanup.delay.ms` config.
>
> Thoughts?
>
>
> -Matthias
>
> On 1/18/26 11:01 AM, Uladzislau Blok wrote:
> > Hello Matthias,
> >
> > Thanks for the feedback on the KIP.
> >
> > It seems we had a slight misunderstanding regarding the cleanup logic,
> but
> > after revisiting the ticket and the existing codebase, your suggestion to
> > wipe stores older than state.cleanup.delay.ms makes perfect sense. I
> have
> > updated the KIP accordingly, and it is now ready for a second round of
> > review.
> >
> > I would like to highlight two specific points for further discussion:
> >
> >     -
> >
> >     This proposal might cause global stores to be deleted if they aren't
> >     updated often. Currently, we check the last modification time of the
> >     directory. If a global table hasn't changed, it might be cleaned up
> even if
> >     the data is still valid. However, since these tables are usually
> small,
> >     this might not be a major issue. What do you think?
> >     -
> >
> >     We previously discussed increasing the default value for
> >     state.cleanup.delay.ms to be less aggressive. Do we have any
> consensus
> >     on a reasonable default, or a recommended methodology for measuring
> what
> >     this value should be?
> >
> > Regards,
> > Uladzislau Blok.
> >
> > On Mon, Jan 12, 2026 at 2:55 AM Matthias J. Sax <[email protected]>
> wrote:
> >
> >> Thanks for the KIP Uladzislau.
> >>
> >> Given that you propose to wipe the entire state if this config is set, I
> >> am wondering if we would need such a config to begin with, or if users
> >> could implement this themselves (via some custom config the application
> >> code uses) and calls `KafkaStreams#cleanUp()` to wipe out all local
> >> state if this custom config is set?
> >>
> >> I believe to remember from the original ticket discussion, that the idea
> >> was not to blindly wipe the entire state, but to do it still based on
> >> task directory age, similar to what the background cleaner thread does
> >> (based on `state.cleanup.delay.ms` config). And to trigger a cleanup
> run
> >> before startup. Thoughts?
> >>
> >>
> >> -Matthias
> >>
> >> On 12/21/25 6:37 AM, Uladzislau Blok wrote:
> >>> Hi everyone,
> >>>
> >>> I'd like to start a discussion on *KIP-1259: Add configuration to wipe
> >>> local state on startup*.
> >>> Problem
> >>>
> >>> Currently, Kafka Streams can encounter a "zombie data" issue when an
> >>> instance restarts using stale local files after a period exceeding the
> >>> changelog topic's delete.retention.ms. If the local checkpoint offset
> is
> >>> still within the broker's available log range (due to long-lived
> >> entities),
> >>> an automatic reset isn't triggered. However, since the broker has
> already
> >>> purged deletion tombstones, the state store is rehydrated without the
> >>> "delete" instructions, causing previously deleted entities to
> >> unexpectedly
> >>> reappear in the local RocksDB.
> >>> Proposed Solution
> >>>
> >>> I propose introducing a new configuration, state.cleanup.on.start
> >> (Boolean,
> >>> default: false). When enabled, this property forces the deletion of all
> >>> local state directories and checkpoint files during application
> >>> initialization. This ensures the state is rebuilt entirely from the
> >>> changelog—the broker's "source of truth"—effectively purging any
> expired
> >>> zombie records.
> >>>
> >>> This is particularly useful for environments with persistent volumes
> >> where
> >>> instances might remain dormant for long periods (e.g., multi-region
> >>> failover).
> >>>
> >>> *KIP Link: *
> >>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1259%3A+Add+configuration+to+wipe+Kafka+Streams+local+state+on+startup
> >>>
> >>>
> >>> I look forward to your feedback and suggestions.
> >>>
> >>>
> >>> Best regards,
> >>>
> >>> Uladzislau Blok
> >>>
> >>
> >>
> >
>
>

Reply via email to