Hi Gaurav, thanks for your pointer. The observation in the link is (at 
least qualitatively) similar to mine.

Now the question is, if I do have big data (40GB, cached size is 60GB) and 
even big memory (192 GB), I cannot benefit from RDD cache, and should 
persist on disk and leverage filesystem cache?

I will try more workers so that each JVM has a smaller heap.

Best regards,
Wei

---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   Gaurav Jain <ja...@student.ethz.ch>
To:     u...@spark.incubator.apache.org, 
Date:   06/18/2014 06:30 AM
Subject:        Re: rdd.cache() is not faster?



You cannot assume that caching would always reduce the execution time,
especially if the data-set is large. It appears that if too much memory is
used for caching, then less memory is left for the actual computation
itself. There has to be a balance between the two. 

Page 33 of this thesis from KTH talks about this:
http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf

Best



-----
Gaurav Jain
Master's Student, D-INFK
ETH Zurich
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/rdd-cache-is-not-faster-tp7804p7835.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.


Reply via email to