Hi, As I understand, your problem is similar to this JIRA.
https://issues.apache.org/jira/browse/SPARK-1647 The issue in this case, Kafka can not replay the message as offsets are already committed. Also I think existing KafkaUtils ( The Default High Level Kafka Consumer) also have this issue. Similar discussion is there in this thread also... http://apache-spark-user-list.1001560.n3.nabble.com/Data-loss-Spark-streaming-and-network-receiver-td12337.html As I am thinking, it is possible to tackle this in the consumer code I have written. If we can store the topic partition_id and consumed offset in ZK after every checkpoint , then after Spark recover from the fail over, the present PartitionManager code can start reading from last checkpointed offset ( instead last committed offset as it is doing now) ..In that case it can replay the data since last checkpoint. I will think over it .. Regards, Dibyendu On Mon, Aug 25, 2014 at 11:23 PM, RodrigoB <rodrigo.boav...@aspect.com> wrote: > Hi Dibyendu, > > My colleague has taken a look at the spark kafka consumer github you have > provided and started experimenting. > > We found that somehow when Spark has a failure after a data checkpoint, the > expected re-computations correspondent to the metadata checkpoints are not > recovered so we loose Kafka messages and RDD's computations in Spark. > The impression is that this code is replacing quite a bit of Spark Kafka > Streaming code where maybe (not sure) metadata checkpoints are done every > batch interval. > > Was it on purpose to solely depend on the Kafka commit to recover data and > recomputations between data checkpoints? If so, how to make this work? > > tnks > Rod > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p12757.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >