Hi Gaurav, thanks for your pointer. The observation in the link is (at least qualitatively) similar to mine.
Now the question is, if I do have big data (40GB, cached size is 60GB) and even big memory (192 GB), I cannot benefit from RDD cache, and should persist on disk and leverage filesystem cache? I will try more workers so that each JVM has a smaller heap. Best regards, Wei --------------------------------- Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From: Gaurav Jain <ja...@student.ethz.ch> To: u...@spark.incubator.apache.org, Date: 06/18/2014 06:30 AM Subject: Re: rdd.cache() is not faster? You cannot assume that caching would always reduce the execution time, especially if the data-set is large. It appears that if too much memory is used for caching, then less memory is left for the actual computation itself. There has to be a balance between the two. Page 33 of this thesis from KTH talks about this: http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf Best ----- Gaurav Jain Master's Student, D-INFK ETH Zurich -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/rdd-cache-is-not-faster-tp7804p7835.html Sent from the Apache Spark User List mailing list archive at Nabble.com.