You can also always take the more extreme approach of using SparkContext#runJob (or submitJob) to write a custom Action that does what you want in one pass. Usually that's not worth the extra effort.
On Thu, Mar 26, 2015 at 9:27 AM, Sean Owen <so...@cloudera.com> wrote: > To avoid computing twice you need to persist the RDD but that need not be > in memory. You can persist to disk with persist(). > On Mar 26, 2015 4:11 PM, "Wang, Ningjun (LNG-NPV)" < > ningjun.w...@lexisnexis.com> wrote: > >> I have a rdd that is expensive to compute. I want to save it as object >> file and also print the count. How can I avoid double computation of the >> RDD? >> >> >> >> val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line)) >> >> >> >> val count = rdd.count() // this force computation of the rdd >> >> println(count) >> >> rdd.saveAsObjectFile(file2) // this compute the RDD again >> >> >> >> I can avoid double computation by using cache >> >> >> >> val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line)) >> >> rdd.cache() >> >> val count = rdd.count() >> >> println(count) >> >> rdd.saveAsObjectFile(file2) // this compute the RDD again >> >> >> >> This only compute rdd once. However the rdd has millions of items and >> will cause out of memory. >> >> >> >> Question: how can I avoid double computation without using cache? >> >> >> >> >> >> Ningjun >> >