Re: Declaring multiple RDDs and efficiency concerns

Sean Owen Fri, 14 Nov 2014 09:46:08 -0800

This code executes on the driver, and an "RDD" here is really just a
handle on all the distributed data out there. It's a local bookkeeping
object. So, manipulation of these objects themselves in the local
driver code has virtually no performance impact. These two versions
would be about identical*.


* maybe someone can point out a case where not maintaining the
reference lets something get cleaned up earlier, but I'm not aware of
this sort of effect

On Fri, Nov 14, 2014 at 4:31 PM, Simone Franzini <captainfr...@gmail.com> wrote:
> Let's say I have to apply a complex sequence of operations to a certain RDD.
> In order to make code more modular/readable, I would typically have
> something like this:
>
> object myObject {
>   def main(args: Array[String]) {
>     val rdd1 = function1(myRdd)
>     val rdd2 = function2(rdd1)
>     val rdd3 = function3(rdd2)
>   }
>
>   def function1(rdd: RDD) : RDD = { doSomething }
>   def function2(rdd: RDD) : RDD = { doSomethingElse }
>   def function3(rdd: RDD) : RDD = { doSomethingElseYet }
> }
>
> So I am explicitly declaring vals for the intermediate steps. Does this end
> up using more storage than if I just chained all of the operations and
> declared only one val instead?
> If yes, is there a better way to chain together the operations?
> Ideally I would like to do something like:
>
> val rdd = function1.function2.function3
>
> Is there a way I can write the signature of my functions to accomplish this?
> Is this also an efficiency issue or just a stylistic one?
>
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Declaring multiple RDDs and efficiency concerns

Reply via email to