Hi All, Thanks for the info. I have one more doubt - When writing a streaming application, I specify batch-interval. Lets say if the interval is 1sec, for every 1sec batch, rdd is formed and launches a job. If there are >1 action specified on an rdd....how many jobs would it launch???
I mean every 1sec batch launches a job and suppose there are two actions then internally 2 more jobs launched ? On Sun, Sep 6, 2015 at 1:15 PM, ayan guha <[email protected]> wrote: > Hi > > "... Here in job2, when calculating rdd.first..." > > If you mean if rdd2.first, then it uses rdd2 already computed by > rdd2.count, because it is already available. If some partitions are not > available due to GC, then only those partitions are recomputed. > > On Sun, Sep 6, 2015 at 5:11 PM, Jeff Zhang <[email protected]> wrote: > >> If you want to reuse the data, you need to call rdd2.cache >> >> >> >> On Sun, Sep 6, 2015 at 2:33 PM, Priya Ch <[email protected]> >> wrote: >> >>> Hi All, >>> >>> In Spark, each action results in launching a job. Lets say my spark app >>> looks as- >>> >>> val baseRDD =sc.parallelize(Array(1,2,3,4,5),2) >>> val rdd1 = baseRdd.map(x => x+2) >>> val rdd2 = rdd1.filter(x => x%2 ==0) >>> val count = rdd2.count >>> val firstElement = rdd2.first >>> >>> println("Count is"+count) >>> println("First is"+firstElement) >>> >>> Now, rdd2.count launches job0 with 1 task and rdd2.first launches job1 >>> with 1 task. Here in job2, when calculating rdd.first, is the entire >>> lineage computed again or else as job0 already computes rdd2, is it reused >>> ??? >>> >>> Thanks, >>> Padma Ch >>> >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > > > -- > Best Regards, > Ayan Guha >
