gwen, i thought about it a little more and i feel pretty confident i can make it so that it's deterministic in case of node failure. will push that change out after holidays.
On Mon, Dec 15, 2014 at 12:03 AM, Koert Kuipers <ko...@tresata.com> wrote: > > hey gwen, > > no immediate plans to contribute it to spark but of course we are open to > this. given sparks pullreq backlog my suspicion is that spark community > prefers a user library at this point. > > if you lose a node the task will restart. and since each task reads until > the end of a kafka partition, which is somewhat of a moving target, the > resulting data will not be the same (it will include whatever was written > to the partition in the meantime). i am not sure if there is an elegant way > to implement this where the data would be the same upon a task restart. > > if you need the data read to be the same upon retry this can be done with > a transformation on the rdd in some cases. for example if you need data > exactly up to midnight you can include a timestamp in the data, and start > the KafkaRDD sometime just after midnight, and then filter to remove any > data with a timestamp after midnight. now the filtered rdd will be the same > even if there is a node failure. > > On Sun, Dec 14, 2014 at 8:27 PM, Gwen Shapira <gshap...@cloudera.com> > wrote: >> >> Thank you, this is really cool! Are you planning on contributing this to >> Spark? >> >> Another question? What's the behavior if I lose a node while my Spark >> App is running? Will the RDD recovery process get the exact same data >> from Kafka as the original? even if we wrote additional data to Kafka >> in the mean time? >> >> Gwen >> >> On Sun, Dec 14, 2014 at 5:22 PM, Koert Kuipers <ko...@tresata.com> wrote: >> > hello all, >> > we at tresata wrote a library to provide for batch integration between >> > spark and kafka. it supports: >> > * distributed write of rdd to kafa >> > * distributed read of rdd from kafka >> > >> > our main use cases are (in lambda architecture speak): >> > * periodic appends to the immutable master dataset on hdfs from kafka >> using >> > spark >> > * make non-streaming data available in kafka with periodic data drops >> from >> > hdfs using spark. this is to facilitate merging the speed and batch >> layers >> > in spark-streaming >> > * distributed writes from spark-streaming >> > >> > see here: >> > https://github.com/tresata/spark-kafka >> > >> > best, >> > koert >> >