RE: Dstream Transformations

Jahagirdar, Madhu Mon, 06 Oct 2014 02:33:13 -0700

Doesn't spark keep track of the DAG lineage and start from where it has stopped 
? Does it have to always start from the beginning of the lineage when the job 
fails ?

________________________________
From: Massimiliano Tomassi [max.toma...@gmail.com]
Sent: Monday, October 06, 2014 2:40 PM
To: Jahagirdar, Madhu
Cc: Akhil Das; user
Subject: Re: Dstream Transformations

>From the Spark Streaming Programming Guide 
>(http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node):

...output operations (like foreachRDD) have at-least once semantics, that is, 
the transformed data may get written to an external entity more than once in 
the event of a worker failure.

I think that when a worker fails the entire graph of transformations/actions 
will be reapplied again on that RDD. This means that, in your case, both the 
storing operations will be executed again. For this reason, in a video I've 
watched on youtube, they suggest to make all the output operations idempotent. 
Obviously not always this is possible unfortunately: e.g. you are building an 
analytics system and you need to increment counters.

This is what I've got so far, anyone having a different point of view?

On 6 October 2014 08:59, Jahagirdar, Madhu 
<madhu.jahagir...@philips.com<mailto:madhu.jahagir...@philips.com>> wrote:
Given that I have multiple worker nodes and when Spark schedules the job again 
on the worker nodes that are alive, does it then again store the data in 
elastic search and then flume or does it only run functions to store in flume ?

Regards,
Madhu Jahagirdar

________________________________
From: Akhil Das [ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: Monday, October 06, 2014 1:20 PM
To: Jahagirdar, Madhu
Cc: user
Subject: Re: Dstream Transformations

AFAIK spark doesn't restart worker nodes itself. You can have multiple worker 
nodes and in that case if one worker node goes down, then spark will try to 
recompute those lost RDDs again with those workers who are alive.

Thanks
Best Regards

On Sun, Oct 5, 2014 at 5:19 AM, Jahagirdar, Madhu 
<madhu.jahagir...@philips.com<mailto:madhu.jahagir...@philips.com>> wrote:
In my spark streaming program I have created kafka utils to receive data and 
store data in elastic search and in flume. Storing function is applied on same 
dstream. My question what is the behavior of spark if after storing data in 
elastic search the worker node dies before storing in flume? Does it  restart 
worker and then again store the data in elastic search and then flume or does 
it only run functions to store in flume.

Regards
Madhu Jahagirdar

________________________________
The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

--
------------------------------------------------
Massimiliano Tomassi
------------------------------------------------
web: http://about.me/maxtomassi
e-mail: max.toma...@gmail.com<mailto:max.toma...@gmail.com>
------------------------------------------------

RE: Dstream Transformations

Reply via email to