If you want to reuse the data, you need to call rdd2.cache
On Sun, Sep 6, 2015 at 2:33 PM, Priya Ch <[email protected]> wrote: > Hi All, > > In Spark, each action results in launching a job. Lets say my spark app > looks as- > > val baseRDD =sc.parallelize(Array(1,2,3,4,5),2) > val rdd1 = baseRdd.map(x => x+2) > val rdd2 = rdd1.filter(x => x%2 ==0) > val count = rdd2.count > val firstElement = rdd2.first > > println("Count is"+count) > println("First is"+firstElement) > > Now, rdd2.count launches job0 with 1 task and rdd2.first launches job1 > with 1 task. Here in job2, when calculating rdd.first, is the entire > lineage computed again or else as job0 already computes rdd2, is it reused > ??? > > Thanks, > Padma Ch > > -- Best Regards Jeff Zhang
