Re: Spark - launchng job for each action

Priya Ch Sun, 06 Sep 2015 07:31:32 -0700

Hi All,

 Thanks for the info. I have one more doubt -
When writing a streaming application, I specify batch-interval. Lets say if
the interval is 1sec, for every 1sec batch, rdd is formed and launches a
job. If there are >1 action specified on an rdd....how many jobs would it
launch???


I mean every 1sec batch launches a job and suppose there are two actions
then internally 2 more jobs launched ?

On Sun, Sep 6, 2015 at 1:15 PM, ayan guha <[email protected]> wrote:

> Hi
>
> "... Here in job2, when calculating rdd.first..."
>
> If you mean if rdd2.first, then it uses rdd2 already computed by
> rdd2.count, because it is already available. If some partitions are not
> available due to GC, then only those partitions are recomputed.
>
> On Sun, Sep 6, 2015 at 5:11 PM, Jeff Zhang <[email protected]> wrote:
>
>> If you want to reuse the data, you need to call rdd2.cache
>>
>>
>>
>> On Sun, Sep 6, 2015 at 2:33 PM, Priya Ch <[email protected]>
>> wrote:
>>
>>> Hi All,
>>>
>>>  In Spark, each action results in launching a job. Lets say my spark app
>>> looks as-
>>>
>>> val baseRDD =sc.parallelize(Array(1,2,3,4,5),2)
>>> val rdd1 = baseRdd.map(x => x+2)
>>> val rdd2 = rdd1.filter(x => x%2 ==0)
>>> val count = rdd2.count
>>> val firstElement = rdd2.first
>>>
>>> println("Count is"+count)
>>> println("First is"+firstElement)
>>>
>>> Now, rdd2.count launches  job0 with 1 task and rdd2.first launches job1
>>> with 1 task. Here in job2, when calculating rdd.first, is the entire
>>> lineage computed again or else as job0 already computes rdd2, is it reused
>>> ???
>>>
>>> Thanks,
>>> Padma Ch
>>>
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Spark - launchng job for each action

Reply via email to