Radek: I don't see how data could be stored more efficiently than in Kafka itself. It's optimized for cheap storage and offers high-performance bulk export, exactly what you want from long-term archival. On fre. 10. jul. 2015 at 23.16 Rad Gruchalski <ra...@gruchalski.com> wrote:
> Hello all, > > This is a very interesting discussion. I’ve been thinking of a similar use > case for Kafka over the last few days. > The usual data workflow with Kafka is most likely something this: > > - ingest with Kafka > - process with Storm / Samza / whathaveyou > - put some processed data back on Kafka > - at the same time store the raw data somewhere in case if everything > has to be reprocessed in the future (hdfs, similar?) > > Currently Kafka offers a couple of types of topics: regular stream > (non-compacted topic) and a compacted topic (key/value). In case of a > stream topic, when the compaction kicks in, the “old” data is truncated. It > is lost from Kafka. What if there was an additional compaction setting: > cold-store. > Instead of trimming old data, Kafka would compile old data into a separate > log with its own index. The user would be free to decide what to do with > such files: put them on NFS / S3 / Swift / HDFS… Actually, the index file > is not needed. The only 3 things are: > > - the folder name / partition index > - the log itself > - topic metadata at the time of taking the data out of the segment > > With all this info, reading data back is fairly easy, even without > starting Kafka, sample program goes like this (scala-ish): > > val props = new Properties() > props.put("log.segment.bytes", "1073741824") > props.put("segment.index.bytes", "10485760") // should be 10MB > > val log = new Log( > new File(“/somestorage/kafka-test-0"), > cfg, > 0L, > null ) > > val fdi = log.activeSegment.read( log.logStartOffset, > Some(log.logEndOffset), 1000000 ) > var msgs = 1 > fdi.messageSet.iterator.foreach { msgoffset => > println( s" ${msgoffset.message.hasKey} ::: > $msgs ::::> > ${msgoffset.offset} :::::: ${msgoffset.nextOffset}" ) > msgs = msgs + 1 > val key = new String( msgoffset.message.key.array(), "UTF-8") > val msg = new String( msgoffset.message.payload.array(), "UTF-8") > println( s" === ${key} " ) > println( s" === ${msg} " ) > } > > > This reads from active segment (the last known segment) but it’s easy to > make it read from all segments. The interesting thing is - as long as the > back up files are well formed, they can be read without having to put them > in Kafka itself. > > The advantage is: what was once the raw data (as it came in), is the raw > data forever, without having to introduce another format for storing this. > Another advantage is: in case of reprocessing, no need to write a producer > to ingest the data back and so on, so on (it’s possible but not necessary). > Such raw Kafka files can be easily processed by Storm / Samza (would need > another stream definition) / Hadoop. > > This sounds like a very useful addition to Kafka. But I could be > overthinking this... > > > > > > > > > > > Kind regards, > Radek Gruchalski > ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto: > ra...@gruchalski.com) > de.linkedin.com/in/radgruchalski/ ( > http://de.linkedin.com/in/radgruchalski/) > > Confidentiality: > This communication is intended for the above-named person and may be > confidential and/or legally privileged. > If it has come to you in error you must take no action based on it, nor > must you copy or show it to anyone; please delete/destroy and inform the > sender immediately. > > > > On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote: > > > > > > On 10. jul. 2015, at 15.16, Shayne S <shaynest...@gmail.com (mailto: > shaynest...@gmail.com)> wrote: > > > > > > There are two ways you can configure your topics, log compaction and > with > > > no cleaning. The choice depends on your use case. Are the records > uniquely > > > identifiable and will they receive updates? Then log compaction is the > way > > > to go. If they are truly read only, you can go without log compaction. > > > > > > > > > I'd rather be free to use the key for partitioning, and the records are > immutable — they're event records — so disabling compaction altogether > would be preferable. How is that accomplished? > > > > > > We have a small processes which consume a topic and perform upserts to > our > > > various database engines. It's easy to change how it all works and > simply > > > consume the single source of truth again. > > > > > > I've written a bit about log compaction here: > > > > http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ > > > > > > On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck < > > > daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com)> > wrote: > > > > > > > I'd like to use Kafka as a persistent store – sort of as an > alternative to > > > > HDFS. The idea is that I'd load the data into various other systems > in > > > > order to solve specific needs such as full-text search, analytics, > indexing > > > > by various attributes, etc. I'd like to keep a single source of > truth, > > > > however. > > > > > > > > I'm struggling a bit to understand how I can configure a topic to > retain > > > > messages indefinitely. I want to make sure that my data isn't > deleted. Is > > > > there a guide to configuring Kafka like this? > > > > > > > > > > > > > > > > > > >