Doesn't spark keep track of the DAG lineage and start from where it has stopped ? Does it have to always start from the beginning of the lineage when the job fails ?
________________________________ From: Massimiliano Tomassi [max.toma...@gmail.com] Sent: Monday, October 06, 2014 2:40 PM To: Jahagirdar, Madhu Cc: Akhil Das; user Subject: Re: Dstream Transformations >From the Spark Streaming Programming Guide >(http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node): ...output operations (like foreachRDD) have at-least once semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. I think that when a worker fails the entire graph of transformations/actions will be reapplied again on that RDD. This means that, in your case, both the storing operations will be executed again. For this reason, in a video I've watched on youtube, they suggest to make all the output operations idempotent. Obviously not always this is possible unfortunately: e.g. you are building an analytics system and you need to increment counters. This is what I've got so far, anyone having a different point of view? On 6 October 2014 08:59, Jahagirdar, Madhu <madhu.jahagir...@philips.com<mailto:madhu.jahagir...@philips.com>> wrote: Given that I have multiple worker nodes and when Spark schedules the job again on the worker nodes that are alive, does it then again store the data in elastic search and then flume or does it only run functions to store in flume ? Regards, Madhu Jahagirdar ________________________________ From: Akhil Das [ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>] Sent: Monday, October 06, 2014 1:20 PM To: Jahagirdar, Madhu Cc: user Subject: Re: Dstream Transformations AFAIK spark doesn't restart worker nodes itself. You can have multiple worker nodes and in that case if one worker node goes down, then spark will try to recompute those lost RDDs again with those workers who are alive. Thanks Best Regards On Sun, Oct 5, 2014 at 5:19 AM, Jahagirdar, Madhu <madhu.jahagir...@philips.com<mailto:madhu.jahagir...@philips.com>> wrote: In my spark streaming program I have created kafka utils to receive data and store data in elastic search and in flume. Storing function is applied on same dstream. My question what is the behavior of spark if after storing data in elastic search the worker node dies before storing in flume? Does it restart worker and then again store the data in elastic search and then flume or does it only run functions to store in flume. Regards Madhu Jahagirdar ________________________________ The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org<mailto:user-h...@spark.apache.org> -- ------------------------------------------------ Massimiliano Tomassi ------------------------------------------------ web: http://about.me/maxtomassi e-mail: max.toma...@gmail.com<mailto:max.toma...@gmail.com> ------------------------------------------------