Re: [DISCUSS] KIP-107: Add purgeDataBefore() API in AdminClient

Ewen Cheslack-Postava Tue, 03 Jan 2017 23:02:34 -0800

On Tue, Jan 3, 2017 at 5:30 PM, radai <radai.rosenbl...@gmail.com> wrote:

> also 4. some apps may do their own offset bookkeeping
>

This is definitely a fair point, but if you want aggressive cleanup of data
in Kafka, you can dual commit with the Kafka commit happening second. I
don't see how this would be a problem -- inconsistency isn't a problem
since "late" commits to Kafka would only affect how quickly data is cleaned
up. If we miss the offset commit to Kafka after committing offsets to some
other system, we'd just delay deleting data for a short time. (A great
example of taking advantage of this would be the HDFS connector for Kafka
Connect, which manages its own offsets, but where users might like to be
able to more aggressively clean up data once it has landed in HDFS. I'd
love to see support for this integrated in the HDFS connector.)

I don't think the proposed approach is a bad idea, I just want to
understand the space of design options and their tradeoffs.

>
> On Tue, Jan 3, 2017 at 5:29 PM, radai <radai.rosenbl...@gmail.com> wrote:
>
> > the issue with tracking committed offsets is whos offsets do you track?
> >
> > 1. some topics have multiple groups
>

Couldn't this go into the topic-level config? This is why I mentioned 1 vs
multiple groups in my earlier reply. 1 group keeps things simple wrt how
deciding if deleting log segments happens and would easily fit into a
topic-level config (I think it doesn't require additional state in memory
despite requiring consuming all __consumer_offsets partitions); multiple
groups complicates how the config is specified and how the state to
determine if we can delete a log segment is tracked. That said, I don't see
a fundamental reason we couldn't support multiple consumer groups per topic.

> > 2. some "groups" are really one-offs like developers spinning up console
> > consumer "just to see if there's data"
>

This seems very counter to the motivating use case in the KIP for
intermediate stream processing topics? The stated use case is for stream
processing apps where, presumably, there would be a single, fixed,
deterministically named consumer for the data?

> > 3. there are use cases where you want to deliberately "wipe" data EVEN IF
> > its still being consumed
>

What are these use cases? Can we get them enumerated in the KIP so we
understand the use cases and conditions where this would happen? What are
the cases that wouldn't be covered by existing retention policies? The only
new type of policy proposed so far is based on whether data has been
consumed or not; is there something new besides a) time-based b) size-based
or c) consumed-based?

> >
> > #1 is a configuration mess, since there are multiple possible strategies.
> > #2 is problematic without a definition of "liveliness" or special
> handling
> > for console consumer? and #3 is flat out impossible with committed-offset
> > tracking
> >
> > On Tue, Jan 3, 2017 at 3:56 PM, Ewen Cheslack-Postava <e...@confluent.io
> >
> > wrote:
> >
> >> Dong,
> >>
> >> Looks like that's an internal link,
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-107%
> >> 3A+Add+purgeDataBefore%28%29+API+in+AdminClient
> >> is the right one.
> >>
> >> I have a question about one of the rejected alternatives:
> >>
> >> > Using committed offset instead of an extra API to trigger data purge
> >> operation.
> >>
> >> The KIP says this would be more complicated to implement. Why is that? I
> >> think brokers would have to consume the entire offsets topic, but the
> data
> >> stored in memory doesn't seem to change and applying this when updated
> >> offsets are seen seems basically the same. It might also be possible to
> >> make it work even with multiple consumer groups if that was desired
> >> (although that'd require tracking more data in memory) as a
> generalization
> >> without requiring coordination between the consumer groups. Given the
> >> motivation, I'm assuming this was considered unnecessary since this
> >> specifically targets intermediate stream processing topics.
> >>
> >> Another question is why expose this via AdminClient (which isn't public
> >> API
> >> afaik)? Why not, for example, expose it on the Consumer, which is
> >> presumably where you'd want access to it since the functionality depends
> >> on
> >> the consumer actually having consumed the data?
> >>
> >> -Ewen
> >>
> >> On Tue, Jan 3, 2017 at 2:45 PM, Dong Lin <lindon...@gmail.com> wrote:
> >>
> >> > Hi all,
> >> >
> >> > We created KIP-107 to propose addition of purgeDataBefore() API in
> >> > AdminClient.
> >> >
> >> > Please find the KIP wiki in the link https://iwww.corp.linkedin.
> >> > com/wiki/cf/display/ENGS/Kafka+purgeDataBefore%28%29+API+
> >> design+proposal.
> >> > We
> >> > would love to hear your comments and suggestions.
> >> >
> >> > Thanks,
> >> > Dong
> >> >
> >>
> >
> >
>

Re: [DISCUSS] KIP-107: Add purgeDataBefore() API in AdminClient

Reply via email to