On Tue, Jan 3, 2017 at 5:30 PM, radai <radai.rosenbl...@gmail.com> wrote:
> also 4. some apps may do their own offset bookkeeping > This is definitely a fair point, but if you want aggressive cleanup of data in Kafka, you can dual commit with the Kafka commit happening second. I don't see how this would be a problem -- inconsistency isn't a problem since "late" commits to Kafka would only affect how quickly data is cleaned up. If we miss the offset commit to Kafka after committing offsets to some other system, we'd just delay deleting data for a short time. (A great example of taking advantage of this would be the HDFS connector for Kafka Connect, which manages its own offsets, but where users might like to be able to more aggressively clean up data once it has landed in HDFS. I'd love to see support for this integrated in the HDFS connector.) I don't think the proposed approach is a bad idea, I just want to understand the space of design options and their tradeoffs. > > On Tue, Jan 3, 2017 at 5:29 PM, radai <radai.rosenbl...@gmail.com> wrote: > > > the issue with tracking committed offsets is whos offsets do you track? > > > > 1. some topics have multiple groups > Couldn't this go into the topic-level config? This is why I mentioned 1 vs multiple groups in my earlier reply. 1 group keeps things simple wrt how deciding if deleting log segments happens and would easily fit into a topic-level config (I think it doesn't require additional state in memory despite requiring consuming all __consumer_offsets partitions); multiple groups complicates how the config is specified and how the state to determine if we can delete a log segment is tracked. That said, I don't see a fundamental reason we couldn't support multiple consumer groups per topic. > > 2. some "groups" are really one-offs like developers spinning up console > > consumer "just to see if there's data" > This seems very counter to the motivating use case in the KIP for intermediate stream processing topics? The stated use case is for stream processing apps where, presumably, there would be a single, fixed, deterministically named consumer for the data? > > 3. there are use cases where you want to deliberately "wipe" data EVEN IF > > its still being consumed > What are these use cases? Can we get them enumerated in the KIP so we understand the use cases and conditions where this would happen? What are the cases that wouldn't be covered by existing retention policies? The only new type of policy proposed so far is based on whether data has been consumed or not; is there something new besides a) time-based b) size-based or c) consumed-based? > > > > #1 is a configuration mess, since there are multiple possible strategies. > > #2 is problematic without a definition of "liveliness" or special > handling > > for console consumer? and #3 is flat out impossible with committed-offset > > tracking > > > > On Tue, Jan 3, 2017 at 3:56 PM, Ewen Cheslack-Postava <e...@confluent.io > > > > wrote: > > > >> Dong, > >> > >> Looks like that's an internal link, > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-107% > >> 3A+Add+purgeDataBefore%28%29+API+in+AdminClient > >> is the right one. > >> > >> I have a question about one of the rejected alternatives: > >> > >> > Using committed offset instead of an extra API to trigger data purge > >> operation. > >> > >> The KIP says this would be more complicated to implement. Why is that? I > >> think brokers would have to consume the entire offsets topic, but the > data > >> stored in memory doesn't seem to change and applying this when updated > >> offsets are seen seems basically the same. It might also be possible to > >> make it work even with multiple consumer groups if that was desired > >> (although that'd require tracking more data in memory) as a > generalization > >> without requiring coordination between the consumer groups. Given the > >> motivation, I'm assuming this was considered unnecessary since this > >> specifically targets intermediate stream processing topics. > >> > >> Another question is why expose this via AdminClient (which isn't public > >> API > >> afaik)? Why not, for example, expose it on the Consumer, which is > >> presumably where you'd want access to it since the functionality depends > >> on > >> the consumer actually having consumed the data? > >> > >> -Ewen > >> > >> On Tue, Jan 3, 2017 at 2:45 PM, Dong Lin <lindon...@gmail.com> wrote: > >> > >> > Hi all, > >> > > >> > We created KIP-107 to propose addition of purgeDataBefore() API in > >> > AdminClient. > >> > > >> > Please find the KIP wiki in the link https://iwww.corp.linkedin. > >> > com/wiki/cf/display/ENGS/Kafka+purgeDataBefore%28%29+API+ > >> design+proposal. > >> > We > >> > would love to hear your comments and suggestions. > >> > > >> > Thanks, > >> > Dong > >> > > >> > > > > >