" if I do have big data (40GB, cached size is 60GB) and even big memory (192 GB), I cannot benefit from RDD cache, and should persist on disk and leverage filesystem cache?"
The answer to the question of whether to persist (spill-over) data on disk is not always immediately clear, because generally the functions to compute RDD partitions are not as expensive as retrieving the saved partition from disk. That's why, the default STORAGE_LEVEL never stores RDD partitions on disk, and instead computes them on the fly. Also, you can try using Kryo serialization (if not using it already) to reduce memory usage. Playing around with different Storage levels (MEMORY_ONLY_SER, for example) might also help. Best Gaurav Jain Master's Student, D-INFK ETH Zurich Email: jaing at student dot ethz dot ch ----- Gaurav Jain Master's Student, D-INFK ETH Zurich -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/rdd-cache-is-not-faster-tp7804p7846.html Sent from the Apache Spark User List mailing list archive at Nabble.com.