Apologies for asking another question as a newbie without having really tried 
stuff out, but actually one of our main reasons for wanting to use kafka (not 
the linkedin use case) is exactly the fact that the "buffer" is not just for 
buffering. We want to keep data for days to weeks, and be able to add ad-hoc 
consumers after the fact (obviously we could do that based on downstream 
systems in HDFS), however lets say we have N machines gathering approximate 
runtime statistics to use real time in live web applications; it is easy for 
them to listen to the stream destined for HDFS and keep such stats. If we have 
to add a new machine, or one dies etc. it totally makes sense to use the same 
code and just have it replay the last H hours of events to get back up to speed.

So I'm curious if as this thread suggests that there are problems with keeping 
days to weeks of data around them and accessing them.

Note also we are considering using kafka for (continuous/on demand) high 
performance instrumentation at which point we may not actually have any 
consumers until we need them (we would want a back-window to produce debug logs 
from the event stream after the fact, or replay events into other systems), but 
equally the real time feed may be used for alerting and graphite. Also we might 
eventually allow ad-hoc queries against data in the event stream, which may 
require us to turn event generation on/off in the producers, but nonetheless we 
would efficiently filter the kafka event stream based on arbitrary data - 
something that can't be done with topic today (even the suggested hierarchical 
topics) - if we do it right, we can use a schema/producer registry to figure 
out a small subset of topics that might contain the data we need, then use the 
schema registry to pick the AVRO schema used to efficiently filter that subset 
of topics based on any arbitrary set of attributes in the data.

If the latter sounds useful to anyone then we'll of course contribute back - 
I'm also curious on the current state of camel etc, since we were already 
considering building something similar, but it seems like it isn't currently 
(as of recent open source) zookeeper based which seems odd, but also we are 
certainly considering allowing for mixing in more dynamic registration where 
value isn't just schema, but schema + other contextual information common to 
all events from a producer (e.g. source machine, application, app version etc).


On Feb 21, 2013, at 7:26 PM, Jay Kreps <jay.kr...@gmail.com> wrote:

> You can do this and it should work fine. You would have to keep adding
> machines to get disk capacity, of course, since your data set would
> only grow.
> 
> We will keep an open file descriptor per file, but I think that is
> okay. Just set the segment size to 1GB, then with 10TB of storage that
> is only 10k files which should be fine. Adjust the OS open FD limit up
> a bit if needed. File descriptors don't use too much memory so this
> should not hurt anything.
> 
> -Jay
> 
> On Thu, Feb 21, 2013 at 4:00 PM, Anthony Grimes <i...@raynes.me> wrote:
>> Our use case is that we'd like to log data we don't need away and
>> potentially replay it at some point. We don't want to delete old logs. I
>> googled around a bit and I only discovered this particular post:
>> http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201210.mbox/%3CCAFbh0Q2=eJcDT6NvTAPtxhXSk64x0Yms-G-AOqOoy=ftvvm...@mail.gmail.com%3E
>> 
>> In summary, it appears the primary issue is that Kafka keeps file handles of
>> each log segment open. Is there a way to configure this, or is a way to do
>> so planned? It appears that an option to deduplicate instead of delete was
>> added recently, so doesn't the file handle issue exist with that as well
>> (since files aren't being deleted)?

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to