Hi Lawrence, Why do you need so much retention? We've generally found that all use of Kafka that wants really long retention (e.g. for compliance or replay reasons) are better served by consuming from the topic and putting the data on S3 (or some other longer term storage) for anything beyond a few days of retention (LinkedIn use 4, the Kafka default is 7). https://github.com/pinterest/secor is a good sample project that does this.
Storing long term data in Kafka is generally a pretty bad idea, because it's really not designed for it. A big part of that is failure handling. If a broker goes down and another broker has to catch up from the replicas, that could mean transferring terabytes across the network. For example, LinkedIn keep about 25-40GB of retention in a partition for 4 days. If you multiply that out to a year, that's ~3.6 TB on each partition. Considering a single failing broker can and could have many partitions, things will be extremely problematic. Thanks Tom Crayford, Heroku Kafka On Mon, May 2, 2016 at 8:42 AM, Lawrence Weikum <lwei...@pandora.com> wrote: > Using 0.9.0.1. > > I'm building a new topic that should keep data for much longer than the > brokers' default, say at least a year, before deleting messages. > http://kafka.apache.org/documentation.html says setting the "retention.ms" > for the topic will adjust the time, but I cannot find out what unit of time > Kafka uses for this. "ms" would suggest "milliseconds", so a year would be > about 3.154e+13 milliseconds. This seems like an uncomforatably-high > number to give. > > Can anyone else confirm this time unit for "retention.ms" for the topic > config is in milliseconds? Is there also a "retention.minutes" that's just > undocumented? > > Thanks! > > > Lawrence Weikum >