Hi Graham,
This sounds like it should work fine. LinkedIn keeps the majority of
things for 7 days. Performance is linear in data size and we have
validated performance up to many TB of data per machine.
The registry you describe sounds like it could potentially be useful.
You would probably have
Apologies for asking another question as a newbie without having really
tried stuff out, but actually one of our main reasons for wanting to use
kafka (not the linkedin use case) is exactly the fact that the "buffer" is
not just for buffering. We want to keep data for days to weeks, and be able
to
Sounds good. Thanks for the input, kind sir!
Jay Kreps wrote:
You can do this and it should work fine. You would have to keep adding
machines to get disk capacity, of course, since your data set would
only grow.
We will keep an open file descriptor per file, but I think that is
okay. Just set t
Apologies for asking another question as a newbie without having really tried
stuff out, but actually one of our main reasons for wanting to use kafka (not
the linkedin use case) is exactly the fact that the "buffer" is not just for
buffering. We want to keep data for days to weeks, and be able
You can do this and it should work fine. You would have to keep adding
machines to get disk capacity, of course, since your data set would
only grow.
We will keep an open file descriptor per file, but I think that is
okay. Just set the segment size to 1GB, then with 10TB of storage that
is only 10
Forever is a long time. The definition of replay and navigating through
different versions of kafka would be key.
Example:
If you are storing market data into kafka and have a cep engine running on
top and would like replay "transactions" to be fed back to ensure
replayability, then you would prob
Anthony,
Is there a reason you wouldn't want to just push the data into something
built for cheap, long-term storage (like glacier, S3, or HDFS) and perhaps
"replay" from that instead of from the kafka brokers? I can't speak for
Jay, Jun or Neha, but I believe the expected usage of Kafka is essen
Our use case is that we'd like to log data we don't need away and
potentially replay it at some point. We don't want to delete old logs. I
googled around a bit and I only discovered this particular post:
http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201210.mbox/%3CCAFbh0Q2=eJcDT