" if I do have big data (40GB, cached size is 60GB) and even big memory (192
GB), I cannot benefit from RDD cache, and should persist on disk and
leverage filesystem cache?"

The answer to the question of whether to persist (spill-over) data on disk
is not always immediately clear, because generally the functions to compute
RDD partitions are not as expensive as retrieving the saved partition from
disk. That's why, the default STORAGE_LEVEL never stores RDD partitions on
disk, and instead computes them on the fly. Also, you can try using Kryo
serialization (if not using it already) to reduce memory usage. Playing
around with different Storage levels (MEMORY_ONLY_SER, for example) might
also help. 

Best
Gaurav Jain
Master's Student, D-INFK
ETH Zurich
Email: jaing at student dot ethz dot ch



-----
Gaurav Jain
Master's Student, D-INFK
ETH Zurich
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/rdd-cache-is-not-faster-tp7804p7846.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to