This code executes on the driver, and an "RDD" here is really just a handle on all the distributed data out there. It's a local bookkeeping object. So, manipulation of these objects themselves in the local driver code has virtually no performance impact. These two versions would be about identical*.
* maybe someone can point out a case where not maintaining the reference lets something get cleaned up earlier, but I'm not aware of this sort of effect On Fri, Nov 14, 2014 at 4:31 PM, Simone Franzini <captainfr...@gmail.com> wrote: > Let's say I have to apply a complex sequence of operations to a certain RDD. > In order to make code more modular/readable, I would typically have > something like this: > > object myObject { > def main(args: Array[String]) { > val rdd1 = function1(myRdd) > val rdd2 = function2(rdd1) > val rdd3 = function3(rdd2) > } > > def function1(rdd: RDD) : RDD = { doSomething } > def function2(rdd: RDD) : RDD = { doSomethingElse } > def function3(rdd: RDD) : RDD = { doSomethingElseYet } > } > > So I am explicitly declaring vals for the intermediate steps. Does this end > up using more storage than if I just chained all of the operations and > declared only one val instead? > If yes, is there a better way to chain together the operations? > Ideally I would like to do something like: > > val rdd = function1.function2.function3 > > Is there a way I can write the signature of my functions to accomplish this? > Is this also an efficiency issue or just a stylistic one? > > Simone Franzini, PhD > > http://www.linkedin.com/in/simonefranzini --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org