Hard to say without more context around where your job is stalling, what
file sizes you're working with etc.

Best answer would be to test and see, but in general for simple DAGs, I
find that not persisting anything typically runs the fastest. If I persist
anything it would be rdd6 because it took some processing to create and I
might want to use rdd6 for more analyses in the future.

Jon

On Wed, Feb 8, 2017 at 1:40 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> Depends on the use case, but a persist before checkpointing can make sense
> after some of the map steps.
>
> On 8 Feb 2017, at 03:09, Shushant Arora <shushantaror...@gmail.com> wrote:
>
> Hi
>
> I have a workflow like below:
>
> rdd1 = sc.textFile(input);
> rdd2 = rdd1.filter(filterfunc1);
> rdd3 = rdd1.filter(fiterfunc2);
> rdd4 = rdd2.map(mapptrans1);
> rdd5 = rdd3.map(maptrans2);
> rdd6 = rdd4.union(rdd5);
> rdd6.foreach(some transformation);
>
> <image.png>
>
>
>
>
>    1. Do I need to persist rdd1 ?Or its not required since there is only
>    one action at rdd6 which will create only one job and in a single job no
>    need of persist ?
>    2. Also what if transformation on rdd2 is reduceByKey instead of map ?
>    Will this again the same thing no need of persist since single job.
>
>
> Thanks
>
>

Reply via email to