hey gwen, no immediate plans to contribute it to spark but of course we are open to this. given sparks pullreq backlog my suspicion is that spark community prefers a user library at this point.
if you lose a node the task will restart. and since each task reads until the end of a kafka partition, which is somewhat of a moving target, the resulting data will not be the same (it will include whatever was written to the partition in the meantime). i am not sure if there is an elegant way to implement this where the data would be the same upon a task restart. if you need the data read to be the same upon retry this can be done with a transformation on the rdd in some cases. for example if you need data exactly up to midnight you can include a timestamp in the data, and start the KafkaRDD sometime just after midnight, and then filter to remove any data with a timestamp after midnight. now the filtered rdd will be the same even if there is a node failure. On Sun, Dec 14, 2014 at 8:27 PM, Gwen Shapira <gshap...@cloudera.com> wrote: > > Thank you, this is really cool! Are you planning on contributing this to > Spark? > > Another question? What's the behavior if I lose a node while my Spark > App is running? Will the RDD recovery process get the exact same data > from Kafka as the original? even if we wrote additional data to Kafka > in the mean time? > > Gwen > > On Sun, Dec 14, 2014 at 5:22 PM, Koert Kuipers <ko...@tresata.com> wrote: > > hello all, > > we at tresata wrote a library to provide for batch integration between > > spark and kafka. it supports: > > * distributed write of rdd to kafa > > * distributed read of rdd from kafka > > > > our main use cases are (in lambda architecture speak): > > * periodic appends to the immutable master dataset on hdfs from kafka > using > > spark > > * make non-streaming data available in kafka with periodic data drops > from > > hdfs using spark. this is to facilitate merging the speed and batch > layers > > in spark-streaming > > * distributed writes from spark-streaming > > > > see here: > > https://github.com/tresata/spark-kafka > > > > best, > > koert >