Jay, I understand perfectly. I think you have all the reasons in the world to keep the broker truly consumer independent, as it is according to me, a very wise principle that differentiate Kafka from pretty much all the other solutions.
That is why, instead of the idea of consumer sensitive topic as a feature of the broker, i now prefer this to be the responsibility of consumer(s). Therefore, simply exposing a remote call to expire a partition at a given offset would enable consumers to discard data by offset, most likely at the same time they would commit offsets. It sounds to me simpler (as it keeps the broker pretty much as is), and cleaner (as it maintains current design principles) while offering the flexibility of client applications choosing how they want to handle data expiration. Thanks, -----Original Message----- From: Jay Kreps [mailto:jay.kr...@gmail.com] Sent: Thursday, August 28, 2014 12:28 PM To: users@kafka.apache.org Subject: Re: Consumer sensitive expiration of topic Hey Dominique, What you describe makes sense, and it would certainly be possible for the broker to more aggressively discard data once it sees that the consumer has read it once. The reason we haven't really taken that as a priority is because modern drives are so large relative to their throughput that discard is not usually pressing. Practically speaking let's say you have a single cheap 2TB SATA drive and let's say that you are doing 50k 1k messages per second across all topics on that machine (~50MB/sec). In this case you have 2*1024*1024*1024*1024 / (50000 * 1024) / 60 / 60 = 10 hours of retention So even under very high load optimizing discard is not a very pressing concern. That said this would not be a terrible feature to have. -Jay On Thu, Aug 28, 2014 at 8:03 AM, Prunier, Dominique <dominique.prun...@emc.com> wrote: > Yeah, i'm really not worried about performance. Disk space, or more > specifically, disk space by duplication of the same data in different topics > was my concern. The primary use case would be a special consumer which job > would be to partition the messages from a topic into various "private > consumer topics" (without altering it) to provide a filtered subscription > service (e.g. for a remote service on slower network which cannot afford to > receive the whole bunch of data and only wants a subset of it). > > Do you think it would make sense to have a remote API call that manually > expire some partition segments by offset (as opposed to time and/or size) ? > For example, exposing cleanupLogs with additional parameters to cleanup > segments on demand ? I think it would be more than enough for me and could be > used for various other things, like manually truncating a topic which data > isn't relevant anymore without recreating it ? > > Thanks, > > -----Original Message----- > From: Neha Narkhede [mailto:neha.narkh...@gmail.com] > Sent: Wednesday, August 27, 2014 11:36 PM > To: users@kafka.apache.org > Subject: Re: Consumer sensitive expiration of topic > > Kafka is designed to maintain persistent backlog of data on disk > efficiently and at scale. Unlike other messaging systems, doing so does not > affect the performance of the system. If you are worried about the messages > occupying disk space, you can always set a lower retention on the topic > that is higher than any lag your consumer can accrue. The best plan here > would be to plan for allocating disk space for the retention. > > > On Mon, Aug 25, 2014 at 2:25 PM, Prunier, Dominique < > dominique.prun...@emc.com> wrote: > >> Any idea on this usecase guys ? >> >> Thanks, >> >> -----Original Message----- >> From: Prunier, Dominique [mailto:dominique.prun...@emc.com] >> Sent: Friday, August 15, 2014 11:02 AM >> To: users@kafka.apache.org >> Subject: RE: Consumer sensitive expiration of topic >> >> Hi, >> >> Thanks for the answer. >> >> The topics themselves won't be shortlived (as their consumers are supposed >> to stay there), the messages in them will. What i'm trying to achieve is >> something similar to this: >> >> Producers --<topic>--> Processor A0 --<topic_a_1>--> Processor A1 >> --<topic_a_2>--> ... --<topic_a_N>--> Consumer >> |--> Processor B0 --<topic_b_1>--> Processor B1 >> --<topic_a_2>--> ... --<topic_b_N>--> Consumer >> |--> Processor C0 --<topic_c_1>--> Processor C1 >> --<topic_a_2>--> ... --<topic_c_N>--> Consumer >> >> Essentially, the "main" topic is the first one and only one consumed by >> multiple processors/consumers. These processors know what is the next >> processor they should send their data to by knowing their "private" topic >> name. So in this example, once Processor A1 picks a message in topic_a_1 >> and commits the offset, the message won't be used anymore by anyone else. >> >> There is no particular issue just leaving this as is, but topic_a_1 is >> going to buffer quite a lot of stuff on disk while essentially, the only >> thing that we have to deal with here is Processor A1 going down or lagging. >> When Processor A1 is healthy, the expiration of topic_a_1 could be kept >> very low and avoid a fair amount of resource use. >> >> An idea on the top of my head would be an API where you can manually set >> the expiration of a topic by specifying offsets for partitions. This way, >> once Processor A1 has consumed its messages, it could not only commit the >> offsets (which, as far as i understand, has nothing to do with the broker >> itself) but also set the expiration of the topic using the same offsets >> (which could be done less frequently). >> >> Does it make sense ? >> >> Thanks, >> >> -----Original Message----- >> From: Neha Narkhede [mailto:neha.narkh...@gmail.com] >> Sent: Thursday, August 14, 2014 8:10 PM >> To: users@kafka.apache.org >> Subject: Re: Consumer sensitive expiration of topic >> >> By design, Kafka stores data independent of the number of publishers or >> subscribers connecting to it. This provides high performance as the broker >> does not have to manage consumers and evict data based on the consumers >> position. This is one of the main reasons why Kafka is much more >> performance compared to the JMS queues. >> >> It seems like your use case requires the concept of ephemeral topics where >> you would like to auto delete a topic once a particular consumer group has >> finished consuming data from it. Once 0.8.2 is released with the delete >> topic support, we intend to add auto expiration of topics that will delete >> topics that have not been accessed in some configurable time. >> >> Is there a reason why your application needs to create such short lived >> topics? >> >> Thanks, >> Neha >> >> >> On Thu, Aug 14, 2014 at 2:56 PM, Prunier, Dominique < >> dominique.prun...@emc.com> wrote: >> >> > Hi, >> > >> > I'm playing around with Kafka with the idea to implement a general >> purpose >> > message exchanger for a distributed application with high throughput >> > requirements (multiple hundred thousand messages per sec). >> > >> > In this context, i would like to be able to use a topic as some form of >> > private mailbox for a single consumer group. In this situation, once the >> > single consumer group has committed its offset on its private topic, the >> > messages there won't be used anymore and can be safely discarded. >> > Therefore, i was wondering if you'd see a way (in the current release or >> in >> > the future) to have a topic which expiration policy is based on consumer >> > offsets. >> > >> > Thanks, >> > >> > -- >> > Dominique Prunier >> > >> > >>