Daniel, I understand your point. From what I understand the mode that suits you is what Jay suggested: log.retention.ms (http://log.retention.ms) and log.retention.bytes both set to -1.
A few questions before I continue on something what may already be possible: 1. is it possible to attach additional storage without having to restart Kafka? 2. If answer to 1. is yes: will Kafka continue the topic on a new storage if all attached disks are full? Or is the assumption that one data_dir = one topic/partition (the code suggests so). 3. If answer to 1. is no: is it possible to take segments out without having to restart Kafka? Kind regards, Radek Gruchalski ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:ra...@gruchalski.com) de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/) Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately. On Saturday, 11 July 2015 at 22:22, Daniel Schierbeck wrote: > Radek: I don't see how data could be stored more efficiently than in Kafka > itself. It's optimized for cheap storage and offers high-performance bulk > export, exactly what you want from long-term archival. > On fre. 10. jul. 2015 at 23.16 Rad Gruchalski <ra...@gruchalski.com > (mailto:ra...@gruchalski.com)> wrote: > > > Hello all, > > > > This is a very interesting discussion. I’ve been thinking of a similar use > > case for Kafka over the last few days. > > The usual data workflow with Kafka is most likely something this: > > > > - ingest with Kafka > > - process with Storm / Samza / whathaveyou > > - put some processed data back on Kafka > > - at the same time store the raw data somewhere in case if everything > > has to be reprocessed in the future (hdfs, similar?) > > > > Currently Kafka offers a couple of types of topics: regular stream > > (non-compacted topic) and a compacted topic (key/value). In case of a > > stream topic, when the compaction kicks in, the “old” data is truncated. It > > is lost from Kafka. What if there was an additional compaction setting: > > cold-store. > > Instead of trimming old data, Kafka would compile old data into a separate > > log with its own index. The user would be free to decide what to do with > > such files: put them on NFS / S3 / Swift / HDFS… Actually, the index file > > is not needed. The only 3 things are: > > > > - the folder name / partition index > > - the log itself > > - topic metadata at the time of taking the data out of the segment > > > > With all this info, reading data back is fairly easy, even without > > starting Kafka, sample program goes like this (scala-ish): > > > > val props = new Properties() > > props.put("log.segment.bytes", "1073741824") > > props.put("segment.index.bytes", "10485760") // should be 10MB > > > > val log = new Log( > > new File(“/somestorage/kafka-test-0"), > > cfg, > > 0L, > > null ) > > > > val fdi = log.activeSegment.read( log.logStartOffset, > > Some(log.logEndOffset), 1000000 ) > > var msgs = 1 > > fdi.messageSet.iterator.foreach { msgoffset => > > println( s" ${msgoffset.message.hasKey} ::: > $msgs ::::> > > ${msgoffset.offset} :::::: ${msgoffset.nextOffset}" ) > > msgs = msgs + 1 > > val key = new String( msgoffset.message.key.array(), "UTF-8") > > val msg = new String( msgoffset.message.payload.array(), "UTF-8") > > println( s" === ${key} " ) > > println( s" === ${msg} " ) > > } > > > > > > This reads from active segment (the last known segment) but it’s easy to > > make it read from all segments. The interesting thing is - as long as the > > back up files are well formed, they can be read without having to put them > > in Kafka itself. > > > > The advantage is: what was once the raw data (as it came in), is the raw > > data forever, without having to introduce another format for storing this. > > Another advantage is: in case of reprocessing, no need to write a producer > > to ingest the data back and so on, so on (it’s possible but not necessary). > > Such raw Kafka files can be easily processed by Storm / Samza (would need > > another stream definition) / Hadoop. > > > > This sounds like a very useful addition to Kafka. But I could be > > overthinking this... > > > > > > > > > > > > > > > > > > > > > > Kind regards, > > Radek Gruchalski > > ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto: > > ra...@gruchalski.com (mailto:ra...@gruchalski.com)) > > de.linkedin.com/in/radgruchalski/ > > (http://de.linkedin.com/in/radgruchalski/) ( > > http://de.linkedin.com/in/radgruchalski/) > > > > Confidentiality: > > This communication is intended for the above-named person and may be > > confidential and/or legally privileged. > > If it has come to you in error you must take no action based on it, nor > > must you copy or show it to anyone; please delete/destroy and inform the > > sender immediately. > > > > > > > > On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote: > > > > > > > > > On 10. jul. 2015, at 15.16, Shayne S <shaynest...@gmail.com > > > > (mailto:shaynest...@gmail.com) (mailto: > > shaynest...@gmail.com (mailto:shaynest...@gmail.com))> wrote: > > > > > > > > There are two ways you can configure your topics, log compaction and > > with > > > > no cleaning. The choice depends on your use case. Are the records > > > > > > > uniquely > > > > identifiable and will they receive updates? Then log compaction is the > > > > > > > way > > > > to go. If they are truly read only, you can go without log compaction. > > > > > > > > > > > > I'd rather be free to use the key for partitioning, and the records are > > immutable — they're event records — so disabling compaction altogether > > would be preferable. How is that accomplished? > > > > > > > > We have a small processes which consume a topic and perform upserts to > > our > > > > various database engines. It's easy to change how it all works and > > > > > > > simply > > > > consume the single source of truth again. > > > > > > > > I've written a bit about log compaction here: > > http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ > > > > > > > > On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck < > > > > daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com)> > > > > > > > > > > > wrote: > > > > > > > > > I'd like to use Kafka as a persistent store – sort of as an > > alternative to > > > > > HDFS. The idea is that I'd load the data into various other systems > > > > > > > > > > > in > > > > > order to solve specific needs such as full-text search, analytics, > > > > > > > > > > > indexing > > > > > by various attributes, etc. I'd like to keep a single source of > > > > > > > > > > > truth, > > > > > however. > > > > > > > > > > I'm struggling a bit to understand how I can configure a topic to > > retain > > > > > messages indefinitely. I want to make sure that my data isn't > > > > > > > > > > > deleted. Is > > > > > there a guide to configuring Kafka like this? > > > > > > > > > > > > > >