To avoid computing twice you need to persist the RDD but that need not be in memory. You can persist to disk with persist(). On Mar 26, 2015 4:11 PM, "Wang, Ningjun (LNG-NPV)" < ningjun.w...@lexisnexis.com> wrote:
> I have a rdd that is expensive to compute. I want to save it as object > file and also print the count. How can I avoid double computation of the > RDD? > > > > val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line)) > > > > val count = rdd.count() // this force computation of the rdd > > println(count) > > rdd.saveAsObjectFile(file2) // this compute the RDD again > > > > I can avoid double computation by using cache > > > > val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line)) > > rdd.cache() > > val count = rdd.count() > > println(count) > > rdd.saveAsObjectFile(file2) // this compute the RDD again > > > > This only compute rdd once. However the rdd has millions of items and will > cause out of memory. > > > > Question: how can I avoid double computation without using cache? > > > > > > Ningjun >