Hi Arun, The intermediate results like keyedRecordPieces will not be materialized. This indicates that if you run
partitoned = keyedRecordPieces.partitionBy(KeyPartitioner) partitoned.mapPartitions(doComputation).save() again, the keyedRecordPieces will be re-computed . In this case, cache or persist keyedRecordPieces is a good idea to eliminate unnecessary expensive computation. What you can probably do is keyedRecordPieces = records.flatMap( record => Seq(key, recordPieces)).cache() Which will cache the RDD referenced by keyedRecordPieces in memory. For more options on cache and persist, take a look at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD. There are two APIs you can use to persist RDDs and one allows you to specify storage level. Thanks, Liquan On Tue, Sep 23, 2014 at 2:08 PM, Arun Ahuja <aahuj...@gmail.com> wrote: > I have a general question on when persisting will be beneficial and when > it won't: > > I have a task that runs as follow > > keyedRecordPieces = records.flatMap( record => Seq(key, recordPieces)) > partitoned = keyedRecordPieces.partitionBy(KeyPartitioner) > > partitoned.mapPartitions(doComputation).save() > > Is there value in having a persist somewhere here? For example if the > flatMap step is particularly expensive, will it ever be computed twice when > there are no failures? > > Thanks > > Arun > -- Liquan Pei Department of Physics University of Massachusetts Amherst