There should be no difference assuming you don't use the intermediately stored rdd values you are creating for anything else (rdd1, rdd2). In the first example it still is creating these intermediate rdd objects you are just using them implicitly and not storing the value.
It's also worth pointing out that Spark is able to pipeline operations together into stages. That is, it should effectively translate something like like map(f1).map(f2).map(f3) to map(f1 -> f2 -> f3) in pseudcode, if you will. Here is a more detailed explanation from one of the committer's on SO: http://stackoverflow.com/questions/19340808/spark-single-pipelined-scala-command-better-than-separate-commands On Tue, Jun 23, 2015 at 5:17 PM, Ashish Soni <asoni.le...@gmail.com> wrote: > Hi All , > > What is difference between below in terms of execution to the cluster with > 1 or more worker node > > rdd.map(...).map(...)...map(..) > > vs > > val rdd1 = rdd.map(...) > val rdd2 = rdd1.map(...) > val rdd3 = rdd2.map(...) > > Thanks, > Ashish >