Hard to say without more context around where your job is stalling, what file sizes you're working with etc.
Best answer would be to test and see, but in general for simple DAGs, I find that not persisting anything typically runs the fastest. If I persist anything it would be rdd6 because it took some processing to create and I might want to use rdd6 for more analyses in the future. Jon On Wed, Feb 8, 2017 at 1:40 AM, Jörn Franke <jornfra...@gmail.com> wrote: > Depends on the use case, but a persist before checkpointing can make sense > after some of the map steps. > > On 8 Feb 2017, at 03:09, Shushant Arora <shushantaror...@gmail.com> wrote: > > Hi > > I have a workflow like below: > > rdd1 = sc.textFile(input); > rdd2 = rdd1.filter(filterfunc1); > rdd3 = rdd1.filter(fiterfunc2); > rdd4 = rdd2.map(mapptrans1); > rdd5 = rdd3.map(maptrans2); > rdd6 = rdd4.union(rdd5); > rdd6.foreach(some transformation); > > <image.png> > > > > > 1. Do I need to persist rdd1 ?Or its not required since there is only > one action at rdd6 which will create only one job and in a single job no > need of persist ? > 2. Also what if transformation on rdd2 is reduceByKey instead of map ? > Will this again the same thing no need of persist since single job. > > > Thanks > >