I like the idea of the KafkaRDD and Spark partition/split per Kafka partition. That is good use of the SimpleConsumer.
I can see a few different strategies for the commitOffsets and partitionOwnership. What use case are you committing your offsets for? /******************************************* Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop> ********************************************/ On Sun, Dec 14, 2014 at 8:22 PM, Koert Kuipers <ko...@tresata.com> wrote: > > hello all, > we at tresata wrote a library to provide for batch integration between > spark and kafka. it supports: > * distributed write of rdd to kafa > * distributed read of rdd from kafka > > our main use cases are (in lambda architecture speak): > * periodic appends to the immutable master dataset on hdfs from kafka using > spark > * make non-streaming data available in kafka with periodic data drops from > hdfs using spark. this is to facilitate merging the speed and batch layers > in spark-streaming > * distributed writes from spark-streaming > > see here: > https://github.com/tresata/spark-kafka > > best, > koert >