Maybe I misunderstand the proposal, but it sounds like an
"irresponsible" consumer can accidentally delete data that others did
not consume yet?

On Thu, Aug 28, 2014 at 10:06 AM, Prunier, Dominique
<dominique.prun...@emc.com> wrote:
> Jay,
>
> I understand perfectly. I think you have all the reasons in the world to keep 
> the broker truly consumer independent, as it is according to me, a very wise 
> principle that differentiate Kafka from pretty much all the other solutions.
>
> That is why, instead of the idea of consumer sensitive topic as a feature of 
> the broker, i now prefer this to be the responsibility of consumer(s). 
> Therefore, simply exposing a remote call to expire a partition at a given 
> offset would enable consumers to discard data by offset, most likely at the 
> same time they would commit offsets. It sounds to me simpler (as it keeps the 
> broker pretty much as is), and cleaner (as it maintains current design 
> principles) while offering the flexibility of client applications choosing 
> how they want to handle data expiration.
>
>
> Thanks,
>
> -----Original Message-----
> From: Jay Kreps [mailto:jay.kr...@gmail.com]
> Sent: Thursday, August 28, 2014 12:28 PM
> To: users@kafka.apache.org
> Subject: Re: Consumer sensitive expiration of topic
>
> Hey Dominique,
>
> What you describe makes sense, and it would certainly be possible for
> the broker to more aggressively discard data once it sees that the
> consumer has read it once.
>
> The reason we haven't really taken that as a priority is because
> modern drives are so large relative to their throughput that discard
> is not usually pressing. Practically speaking let's say you have a
> single cheap 2TB SATA drive and let's say that you are doing 50k 1k
> messages per second across all topics on that machine (~50MB/sec). In
> this case you have
>    2*1024*1024*1024*1024 / (50000 * 1024) / 60 / 60 = 10 hours of retention
> So even under very high load optimizing discard is not a very pressing 
> concern.
>
> That said this would not be a terrible feature to have.
>
> -Jay
>
> On Thu, Aug 28, 2014 at 8:03 AM, Prunier, Dominique
> <dominique.prun...@emc.com> wrote:
>> Yeah, i'm really not worried about performance. Disk space, or more 
>> specifically, disk space by duplication of the same data in different topics 
>> was my concern. The primary use case would be a special consumer which job 
>> would be to partition the messages from a topic into various "private 
>> consumer topics" (without altering it) to provide a filtered subscription 
>> service (e.g. for a remote service on slower network which cannot afford to 
>> receive the whole bunch of data and only wants a subset of it).
>>
>> Do you think it would make sense to have a remote API call that manually 
>> expire some partition segments by offset (as opposed to time and/or size) ? 
>> For example, exposing cleanupLogs with additional parameters to cleanup 
>> segments on demand ? I think it would be more than enough for me and could 
>> be used for various other things, like manually truncating a topic which 
>> data isn't relevant anymore without recreating it ?
>>
>> Thanks,
>>
>> -----Original Message-----
>> From: Neha Narkhede [mailto:neha.narkh...@gmail.com]
>> Sent: Wednesday, August 27, 2014 11:36 PM
>> To: users@kafka.apache.org
>> Subject: Re: Consumer sensitive expiration of topic
>>
>> Kafka is designed to maintain persistent backlog of data on disk
>> efficiently and at scale. Unlike other messaging systems, doing so does not
>> affect the performance of the system. If you are worried about the messages
>> occupying disk space, you can always set a lower retention on the topic
>> that is higher than any lag your consumer can accrue. The best plan here
>> would be to plan for allocating disk space for the retention.
>>
>>
>> On Mon, Aug 25, 2014 at 2:25 PM, Prunier, Dominique <
>> dominique.prun...@emc.com> wrote:
>>
>>> Any idea on this usecase guys ?
>>>
>>> Thanks,
>>>
>>> -----Original Message-----
>>> From: Prunier, Dominique [mailto:dominique.prun...@emc.com]
>>> Sent: Friday, August 15, 2014 11:02 AM
>>> To: users@kafka.apache.org
>>> Subject: RE: Consumer sensitive expiration of topic
>>>
>>> Hi,
>>>
>>> Thanks for the answer.
>>>
>>> The topics themselves won't be shortlived (as their consumers are supposed
>>> to stay there), the messages in them will. What i'm trying to achieve is
>>> something similar to this:
>>>
>>> Producers --<topic>--> Processor A0 --<topic_a_1>--> Processor A1
>>> --<topic_a_2>--> ... --<topic_a_N>--> Consumer
>>>                   |--> Processor B0 --<topic_b_1>--> Processor B1
>>> --<topic_a_2>--> ... --<topic_b_N>--> Consumer
>>>                   |--> Processor C0 --<topic_c_1>--> Processor C1
>>> --<topic_a_2>--> ... --<topic_c_N>--> Consumer
>>>
>>> Essentially, the "main" topic is the first one and only one consumed by
>>> multiple processors/consumers. These processors know what is the next
>>> processor they should send their data to by knowing their "private" topic
>>> name. So in this example, once Processor A1 picks a message in topic_a_1
>>> and commits the offset, the message won't be used anymore by anyone else.
>>>
>>> There is no particular issue just leaving this as is, but topic_a_1 is
>>> going to buffer quite a lot of stuff on disk while essentially, the only
>>> thing that we have to deal with here is Processor A1 going down or lagging.
>>> When Processor A1 is healthy, the expiration of topic_a_1 could be kept
>>> very low and avoid a fair amount of resource use.
>>>
>>> An idea on the top of my head would be an API where you can manually set
>>> the expiration of a topic by specifying offsets for partitions. This way,
>>> once Processor A1 has consumed its messages, it could not only commit the
>>> offsets (which, as far as i understand, has nothing to do with the broker
>>> itself) but also set the expiration of the topic using the same offsets
>>> (which could be done less frequently).
>>>
>>> Does it make sense ?
>>>
>>> Thanks,
>>>
>>> -----Original Message-----
>>> From: Neha Narkhede [mailto:neha.narkh...@gmail.com]
>>> Sent: Thursday, August 14, 2014 8:10 PM
>>> To: users@kafka.apache.org
>>> Subject: Re: Consumer sensitive expiration of topic
>>>
>>> By design, Kafka stores data independent of the number of publishers or
>>> subscribers connecting to it. This provides high performance as the broker
>>> does not have to manage consumers and evict data based on the consumers
>>> position. This is one of the main reasons why Kafka is much more
>>> performance compared to the JMS queues.
>>>
>>> It seems like your use case requires the concept of ephemeral topics where
>>> you would like to auto delete a topic once a particular consumer group has
>>> finished consuming data from it. Once 0.8.2 is released with the delete
>>> topic support, we intend to add auto expiration of topics that will delete
>>> topics that have not been accessed in some configurable time.
>>>
>>> Is there a reason why your application needs to create such short lived
>>> topics?
>>>
>>> Thanks,
>>> Neha
>>>
>>>
>>> On Thu, Aug 14, 2014 at 2:56 PM, Prunier, Dominique <
>>> dominique.prun...@emc.com> wrote:
>>>
>>> > Hi,
>>> >
>>> > I'm playing around with Kafka with the idea to implement a general
>>> purpose
>>> > message exchanger for a distributed application with high throughput
>>> > requirements (multiple hundred thousand messages per sec).
>>> >
>>> > In this context, i would like to be able to use a topic as some form of
>>> > private mailbox for a single consumer group. In this situation, once the
>>> > single consumer group has committed its offset on its private topic, the
>>> > messages there won't be used anymore and can be safely discarded.
>>> > Therefore, i was wondering if you'd see a way (in the current release or
>>> in
>>> > the future) to have a topic which expiration policy is based on consumer
>>> > offsets.
>>> >
>>> > Thanks,
>>> >
>>> > --
>>> > Dominique Prunier
>>> >
>>> >
>>>

Reply via email to