Re: How Spark Execute chaining vs no chaining statements

Richard Marscher Tue, 23 Jun 2015 14:25:25 -0700

There should be no difference assuming you don't use the intermediately
stored rdd values you are creating for anything else (rdd1, rdd2). In the
first example it still is creating these intermediate rdd objects you are
just using them implicitly and not storing the value.

It's also worth pointing out that Spark is able to pipeline operations
together into stages. That is, it should effectively translate something
like like map(f1).map(f2).map(f3) to map(f1 -> f2 -> f3) in pseudcode, if
you will. Here is a more detailed explanation from one of the committer's
on SO:
http://stackoverflow.com/questions/19340808/spark-single-pipelined-scala-command-better-than-separate-commands

On Tue, Jun 23, 2015 at 5:17 PM, Ashish Soni <asoni.le...@gmail.com> wrote:

> Hi All ,
>
> What is difference between below in terms of execution to the cluster with
> 1 or more worker node
>
> rdd.map(...).map(...)...map(..)
>
> vs
>
> val rdd1 = rdd.map(...)
> val rdd2 = rdd1.map(...)
> val rdd3 = rdd2.map(...)
>
> Thanks,
> Ashish
>

Re: How Spark Execute chaining vs no chaining statements

Reply via email to