Hi Joe, You need to make sure which RDD is used most frequently. In your case, rdd2& rdd3 are filtered result of rdd1, so usually they are relatively smaller than rdd1, and it would be more reasonable to cache rdd2 and/or rdd3 if rdd1is not referenced elsewhere.
Say rdd1 takes 10G, rdd2 takes 1G after filtering, if you cache both of them, you end up with 11G memory consumption, which might not be what you want. Regards Cheng On Mon, Apr 14, 2014 at 8:32 PM, Joe L <selme...@yahoo.com> wrote: > Hi I am trying to cache 2Gbyte data and to implement the following > procedure. > In order to cache them I did as follows: Is it necessary to cache rdd2 > since > rdd1 is already cached? > > rdd1 = textFile("hdfs...").cache() > > rdd2 = rdd1.filter(userDefinedFunc1).cache() > rdd3 = rdd1.filter(userDefinedFunc2).cache() > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Proper-caching-method-tp4206.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >