Maybe I misunderstand the proposal, but it sounds like an "irresponsible" consumer can accidentally delete data that others did not consume yet?
On Thu, Aug 28, 2014 at 10:06 AM, Prunier, Dominique <dominique.prun...@emc.com> wrote: > Jay, > > I understand perfectly. I think you have all the reasons in the world to keep > the broker truly consumer independent, as it is according to me, a very wise > principle that differentiate Kafka from pretty much all the other solutions. > > That is why, instead of the idea of consumer sensitive topic as a feature of > the broker, i now prefer this to be the responsibility of consumer(s). > Therefore, simply exposing a remote call to expire a partition at a given > offset would enable consumers to discard data by offset, most likely at the > same time they would commit offsets. It sounds to me simpler (as it keeps the > broker pretty much as is), and cleaner (as it maintains current design > principles) while offering the flexibility of client applications choosing > how they want to handle data expiration. > > > Thanks, > > -----Original Message----- > From: Jay Kreps [mailto:jay.kr...@gmail.com] > Sent: Thursday, August 28, 2014 12:28 PM > To: users@kafka.apache.org > Subject: Re: Consumer sensitive expiration of topic > > Hey Dominique, > > What you describe makes sense, and it would certainly be possible for > the broker to more aggressively discard data once it sees that the > consumer has read it once. > > The reason we haven't really taken that as a priority is because > modern drives are so large relative to their throughput that discard > is not usually pressing. Practically speaking let's say you have a > single cheap 2TB SATA drive and let's say that you are doing 50k 1k > messages per second across all topics on that machine (~50MB/sec). In > this case you have > 2*1024*1024*1024*1024 / (50000 * 1024) / 60 / 60 = 10 hours of retention > So even under very high load optimizing discard is not a very pressing > concern. > > That said this would not be a terrible feature to have. > > -Jay > > On Thu, Aug 28, 2014 at 8:03 AM, Prunier, Dominique > <dominique.prun...@emc.com> wrote: >> Yeah, i'm really not worried about performance. Disk space, or more >> specifically, disk space by duplication of the same data in different topics >> was my concern. The primary use case would be a special consumer which job >> would be to partition the messages from a topic into various "private >> consumer topics" (without altering it) to provide a filtered subscription >> service (e.g. for a remote service on slower network which cannot afford to >> receive the whole bunch of data and only wants a subset of it). >> >> Do you think it would make sense to have a remote API call that manually >> expire some partition segments by offset (as opposed to time and/or size) ? >> For example, exposing cleanupLogs with additional parameters to cleanup >> segments on demand ? I think it would be more than enough for me and could >> be used for various other things, like manually truncating a topic which >> data isn't relevant anymore without recreating it ? >> >> Thanks, >> >> -----Original Message----- >> From: Neha Narkhede [mailto:neha.narkh...@gmail.com] >> Sent: Wednesday, August 27, 2014 11:36 PM >> To: users@kafka.apache.org >> Subject: Re: Consumer sensitive expiration of topic >> >> Kafka is designed to maintain persistent backlog of data on disk >> efficiently and at scale. Unlike other messaging systems, doing so does not >> affect the performance of the system. If you are worried about the messages >> occupying disk space, you can always set a lower retention on the topic >> that is higher than any lag your consumer can accrue. The best plan here >> would be to plan for allocating disk space for the retention. >> >> >> On Mon, Aug 25, 2014 at 2:25 PM, Prunier, Dominique < >> dominique.prun...@emc.com> wrote: >> >>> Any idea on this usecase guys ? >>> >>> Thanks, >>> >>> -----Original Message----- >>> From: Prunier, Dominique [mailto:dominique.prun...@emc.com] >>> Sent: Friday, August 15, 2014 11:02 AM >>> To: users@kafka.apache.org >>> Subject: RE: Consumer sensitive expiration of topic >>> >>> Hi, >>> >>> Thanks for the answer. >>> >>> The topics themselves won't be shortlived (as their consumers are supposed >>> to stay there), the messages in them will. What i'm trying to achieve is >>> something similar to this: >>> >>> Producers --<topic>--> Processor A0 --<topic_a_1>--> Processor A1 >>> --<topic_a_2>--> ... --<topic_a_N>--> Consumer >>> |--> Processor B0 --<topic_b_1>--> Processor B1 >>> --<topic_a_2>--> ... --<topic_b_N>--> Consumer >>> |--> Processor C0 --<topic_c_1>--> Processor C1 >>> --<topic_a_2>--> ... --<topic_c_N>--> Consumer >>> >>> Essentially, the "main" topic is the first one and only one consumed by >>> multiple processors/consumers. These processors know what is the next >>> processor they should send their data to by knowing their "private" topic >>> name. So in this example, once Processor A1 picks a message in topic_a_1 >>> and commits the offset, the message won't be used anymore by anyone else. >>> >>> There is no particular issue just leaving this as is, but topic_a_1 is >>> going to buffer quite a lot of stuff on disk while essentially, the only >>> thing that we have to deal with here is Processor A1 going down or lagging. >>> When Processor A1 is healthy, the expiration of topic_a_1 could be kept >>> very low and avoid a fair amount of resource use. >>> >>> An idea on the top of my head would be an API where you can manually set >>> the expiration of a topic by specifying offsets for partitions. This way, >>> once Processor A1 has consumed its messages, it could not only commit the >>> offsets (which, as far as i understand, has nothing to do with the broker >>> itself) but also set the expiration of the topic using the same offsets >>> (which could be done less frequently). >>> >>> Does it make sense ? >>> >>> Thanks, >>> >>> -----Original Message----- >>> From: Neha Narkhede [mailto:neha.narkh...@gmail.com] >>> Sent: Thursday, August 14, 2014 8:10 PM >>> To: users@kafka.apache.org >>> Subject: Re: Consumer sensitive expiration of topic >>> >>> By design, Kafka stores data independent of the number of publishers or >>> subscribers connecting to it. This provides high performance as the broker >>> does not have to manage consumers and evict data based on the consumers >>> position. This is one of the main reasons why Kafka is much more >>> performance compared to the JMS queues. >>> >>> It seems like your use case requires the concept of ephemeral topics where >>> you would like to auto delete a topic once a particular consumer group has >>> finished consuming data from it. Once 0.8.2 is released with the delete >>> topic support, we intend to add auto expiration of topics that will delete >>> topics that have not been accessed in some configurable time. >>> >>> Is there a reason why your application needs to create such short lived >>> topics? >>> >>> Thanks, >>> Neha >>> >>> >>> On Thu, Aug 14, 2014 at 2:56 PM, Prunier, Dominique < >>> dominique.prun...@emc.com> wrote: >>> >>> > Hi, >>> > >>> > I'm playing around with Kafka with the idea to implement a general >>> purpose >>> > message exchanger for a distributed application with high throughput >>> > requirements (multiple hundred thousand messages per sec). >>> > >>> > In this context, i would like to be able to use a topic as some form of >>> > private mailbox for a single consumer group. In this situation, once the >>> > single consumer group has committed its offset on its private topic, the >>> > messages there won't be used anymore and can be safely discarded. >>> > Therefore, i was wondering if you'd see a way (in the current release or >>> in >>> > the future) to have a topic which expiration policy is based on consumer >>> > offsets. >>> > >>> > Thanks, >>> > >>> > -- >>> > Dominique Prunier >>> > >>> > >>>