Re: spark kafka batch integration

Koert Kuipers Sun, 14 Dec 2014 21:19:52 -0800

hey joe,
glad to hear you think its a good use case of SimpleConsumer.

not sure if i understand your question. are you asking why we make the
offsets available in the rdd? we have a daily partitioned dataset on hdfs,
and processes that run at night that do the following: 1) read the last
daily partition on hdfs and find out the max offset for each kafka
partition. 2) use this info to create a new KafkaRDD that resumes reading
per kafka partition where we left off, and write to hdfs to create a new
daily partition. this is wasteful in that we re-read the entire previous
daily partition on hdfs to find out the offsets, but it makes the design
very simple and robust. alternatively we could store this kafka offset info
somewhere (its available as the nextOffsets accumulator on KafkaRDD) and
avoid the re-reading of the previous partition. i havent thought about this
much... perhaps in kafka itself?



On Sun, Dec 14, 2014 at 9:56 PM, Joe Stein <joe.st...@stealth.ly> wrote:
>
> I like the idea of the KafkaRDD and Spark partition/split per Kafka
> partition. That is good use of the SimpleConsumer.
>
> I can see a few different strategies for the commitOffsets and
> partitionOwnership.
>
> What use case are you committing your offsets for?
>
> /*******************************************
>  Joe Stein
>  Founder, Principal Consultant
>  Big Data Open Source Security LLC
>  http://www.stealth.ly
>  Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
> ********************************************/
>
> On Sun, Dec 14, 2014 at 8:22 PM, Koert Kuipers <ko...@tresata.com> wrote:
> >
> > hello all,
> > we at tresata wrote a library to provide for batch integration between
> > spark and kafka. it supports:
> > * distributed write of rdd to kafa
> > * distributed read of rdd from kafka
> >
> > our main use cases are (in lambda architecture speak):
> > * periodic appends to the immutable master dataset on hdfs from kafka
> using
> > spark
> > * make non-streaming data available in kafka with periodic data drops
> from
> > hdfs using spark. this is to facilitate merging the speed and batch
> layers
> > in spark-streaming
> > * distributed writes from spark-streaming
> >
> > see here:
> > https://github.com/tresata/spark-kafka
> >
> > best,
> > koert
> >
>

Re: spark kafka batch integration

Reply via email to