Re: Using Kafka as a persistent store

Rad Gruchalski Sat, 11 Jul 2015 13:36:06 -0700

Daniel,  

I understand your point. From what I understand the mode that suits you is what 
Jay suggested: log.retention.ms (http://log.retention.ms) and 
log.retention.bytes both set to -1.


A few questions before I continue on something what may already be possible:

1. is it possible to attach additional storage without having to restart Kafka?
2. If answer to 1. is yes: will Kafka continue the topic on a new storage if 
all attached disks are full? Or is the assumption that one data_dir = one 
topic/partition (the code suggests so).
3. If answer to 1. is no: is it possible to take segments out without having to 
restart Kafka?










Kind regards, 
Radek Gruchalski
 ra...@gruchalski.com (mailto:ra...@gruchalski.com)  
(mailto:ra...@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.



On Saturday, 11 July 2015 at 22:22, Daniel Schierbeck wrote:

> Radek: I don't see how data could be stored more efficiently than in Kafka
> itself. It's optimized for cheap storage and offers high-performance bulk
> export, exactly what you want from long-term archival.
> On fre. 10. jul. 2015 at 23.16 Rad Gruchalski <ra...@gruchalski.com 
> (mailto:ra...@gruchalski.com)> wrote:
>  
> > Hello all,
> >  
> > This is a very interesting discussion. I’ve been thinking of a similar use
> > case for Kafka over the last few days.
> > The usual data workflow with Kafka is most likely something this:
> >  
> > - ingest with Kafka
> > - process with Storm / Samza / whathaveyou
> > - put some processed data back on Kafka
> > - at the same time store the raw data somewhere in case if everything
> > has to be reprocessed in the future (hdfs, similar?)
> >  
> > Currently Kafka offers a couple of types of topics: regular stream
> > (non-compacted topic) and a compacted topic (key/value). In case of a
> > stream topic, when the compaction kicks in, the “old” data is truncated. It
> > is lost from Kafka. What if there was an additional compaction setting:
> > cold-store.
> > Instead of trimming old data, Kafka would compile old data into a separate
> > log with its own index. The user would be free to decide what to do with
> > such files: put them on NFS / S3 / Swift / HDFS… Actually, the index file
> > is not needed. The only 3 things are:
> >  
> > - the folder name / partition index
> > - the log itself
> > - topic metadata at the time of taking the data out of the segment
> >  
> > With all this info, reading data back is fairly easy, even without
> > starting Kafka, sample program goes like this (scala-ish):
> >  
> > val props = new Properties()
> > props.put("log.segment.bytes", "1073741824")
> > props.put("segment.index.bytes", "10485760") // should be 10MB
> >  
> > val log = new Log(
> > new File(“/somestorage/kafka-test-0"),
> > cfg,
> > 0L,
> > null )
> >  
> > val fdi = log.activeSegment.read( log.logStartOffset,
> > Some(log.logEndOffset), 1000000 )
> > var msgs = 1
> > fdi.messageSet.iterator.foreach { msgoffset =>
> > println( s" ${msgoffset.message.hasKey} ::: > $msgs ::::>
> > ${msgoffset.offset} :::::: ${msgoffset.nextOffset}" )
> > msgs = msgs + 1
> > val key = new String( msgoffset.message.key.array(), "UTF-8")
> > val msg = new String( msgoffset.message.payload.array(), "UTF-8")
> > println( s" === ${key} " )
> > println( s" === ${msg} " )
> > }
> >  
> >  
> > This reads from active segment (the last known segment) but it’s easy to
> > make it read from all segments. The interesting thing is - as long as the
> > back up files are well formed, they can be read without having to put them
> > in Kafka itself.
> >  
> > The advantage is: what was once the raw data (as it came in), is the raw
> > data forever, without having to introduce another format for storing this.
> > Another advantage is: in case of reprocessing, no need to write a producer
> > to ingest the data back and so on, so on (it’s possible but not necessary).
> > Such raw Kafka files can be easily processed by Storm / Samza (would need
> > another stream definition) / Hadoop.
> >  
> > This sounds like a very useful addition to Kafka. But I could be
> > overthinking this...
> >  
> >  
> >  
> >  
> >  
> >  
> >  
> >  
> >  
> >  
> > Kind regards,
> > Radek Gruchalski
> > ra...@gruchalski.com (mailto:ra...@gruchalski.com) (mailto:
> > ra...@gruchalski.com (mailto:ra...@gruchalski.com))
> > de.linkedin.com/in/radgruchalski/ 
> > (http://de.linkedin.com/in/radgruchalski/) (
> > http://de.linkedin.com/in/radgruchalski/)
> >  
> > Confidentiality:
> > This communication is intended for the above-named person and may be
> > confidential and/or legally privileged.
> > If it has come to you in error you must take no action based on it, nor
> > must you copy or show it to anyone; please delete/destroy and inform the
> > sender immediately.
> >  
> >  
> >  
> > On Friday, 10 July 2015 at 22:55, Daniel Schierbeck wrote:
> >  
> > >  
> > > > On 10. jul. 2015, at 15.16, Shayne S <shaynest...@gmail.com 
> > > > (mailto:shaynest...@gmail.com) (mailto:
> > shaynest...@gmail.com (mailto:shaynest...@gmail.com))> wrote:
> > > >  
> > > > There are two ways you can configure your topics, log compaction and
> > with
> > > > no cleaning. The choice depends on your use case. Are the records
> > >  
> >  
> > uniquely
> > > > identifiable and will they receive updates? Then log compaction is the
> > >  
> >  
> > way
> > > > to go. If they are truly read only, you can go without log compaction.
> > >  
> > >  
> > >  
> > > I'd rather be free to use the key for partitioning, and the records are
> > immutable — they're event records — so disabling compaction altogether
> > would be preferable. How is that accomplished?
> > > >  
> > > > We have a small processes which consume a topic and perform upserts to
> > our
> > > > various database engines. It's easy to change how it all works and
> > >  
> >  
> > simply
> > > > consume the single source of truth again.
> > > >  
> > > > I've written a bit about log compaction here:
> > http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/
> > > >  
> > > > On Fri, Jul 10, 2015 at 3:46 AM, Daniel Schierbeck <
> > > > daniel.schierb...@gmail.com (mailto:daniel.schierb...@gmail.com)>
> > > >  
> > >  
> >  
> > wrote:
> > > >  
> > > > > I'd like to use Kafka as a persistent store – sort of as an
> > alternative to
> > > > > HDFS. The idea is that I'd load the data into various other systems
> > > >  
> > >  
> >  
> > in
> > > > > order to solve specific needs such as full-text search, analytics,
> > > >  
> > >  
> >  
> > indexing
> > > > > by various attributes, etc. I'd like to keep a single source of
> > > >  
> > >  
> >  
> > truth,
> > > > > however.
> > > > >  
> > > > > I'm struggling a bit to understand how I can configure a topic to
> > retain
> > > > > messages indefinitely. I want to make sure that my data isn't
> > > >  
> > >  
> >  
> > deleted. Is
> > > > > there a guide to configuring Kafka like this?
> > > >  
> > >  
> >  
> >  
>  
>  
>

Re: Using Kafka as a persistent store

Reply via email to