Hi Arun,

The intermediate results like keyedRecordPieces will not be materialized.
This indicates that if you run

partitoned = keyedRecordPieces.partitionBy(KeyPartitioner)

partitoned.mapPartitions(doComputation).save()

again, the keyedRecordPieces will be re-computed . In this case, cache or
persist keyedRecordPieces is a good idea to eliminate unnecessary expensive
computation. What you can probably do is

keyedRecordPieces  = records.flatMap( record => Seq(key,
recordPieces)).cache()

Which will cache the RDD referenced by keyedRecordPieces in memory. For
more options on cache and persist, take a look at
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD.
There are two APIs you can use to persist RDDs and one allows you to
specify storage level.

Thanks,
Liquan



On Tue, Sep 23, 2014 at 2:08 PM, Arun Ahuja <aahuj...@gmail.com> wrote:

> I have a general question on when persisting will be beneficial and when
> it won't:
>
> I have a task that runs as follow
>
> keyedRecordPieces  = records.flatMap( record => Seq(key, recordPieces))
> partitoned = keyedRecordPieces.partitionBy(KeyPartitioner)
>
> partitoned.mapPartitions(doComputation).save()
>
> Is there value in having a persist somewhere here?  For example if the
> flatMap step is particularly expensive, will it ever be computed twice when
> there are no failures?
>
> Thanks
>
> Arun
>



-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst

Reply via email to