Hi Joe,

You need to make sure which RDD is used most frequently. In your case, rdd2&
rdd3 are filtered result of rdd1, so usually they are relatively smaller
than rdd1, and it would be more reasonable to cache rdd2 and/or rdd3
if rdd1is not referenced elsewhere.

Say rdd1 takes 10G, rdd2 takes 1G after filtering, if you cache both of
them, you end up with 11G memory consumption, which might not be what you
want.

Regards
Cheng


On Mon, Apr 14, 2014 at 8:32 PM, Joe L <selme...@yahoo.com> wrote:

> Hi I am trying to cache 2Gbyte data and to implement the following
> procedure.
> In order to cache them I did as follows: Is it necessary to cache rdd2
> since
> rdd1 is already cached?
>
> rdd1 = textFile("hdfs...").cache()
>
> rdd2 = rdd1.filter(userDefinedFunc1).cache()
> rdd3 = rdd1.filter(userDefinedFunc2).cache()
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Proper-caching-method-tp4206.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to