There should be no difference assuming you don't use the intermediately
stored rdd values you are creating for anything else (rdd1, rdd2). In the
first example it still is creating these intermediate rdd objects you are
just using them implicitly and not storing the value.

It's also worth pointing out that Spark is able to pipeline operations
together into stages. That is, it should effectively translate something
like like map(f1).map(f2).map(f3) to map(f1 -> f2 -> f3) in pseudcode, if
you will. Here is a more detailed explanation from one of the committer's
on SO:
http://stackoverflow.com/questions/19340808/spark-single-pipelined-scala-command-better-than-separate-commands

On Tue, Jun 23, 2015 at 5:17 PM, Ashish Soni <asoni.le...@gmail.com> wrote:

> Hi All ,
>
> What is difference between below in terms of execution to the cluster with
> 1 or more worker node
>
> rdd.map(...).map(...)...map(..)
>
> vs
>
> val rdd1 = rdd.map(...)
> val rdd2 = rdd1.map(...)
> val rdd3 = rdd2.map(...)
>
> Thanks,
> Ashish
>

Reply via email to