Re: Difference between Checkpointing and Persist

2019-04-19 Thread Gene Pang
Hi Subash, I'm not sure how the checkpointing works, but with StorageLevel.MEMORY_AND_DISK, Spark will store the RDD in on-heap memory, and spill to disk if necessary. However, the data is only usable by that Spark job. Saving the RDD will write the data out to an external storage system, like HDF

Re: Difference between Checkpointing and Persist

2019-04-18 Thread Vadim Semenov
saving/checkpointing would be preferable in case of a big data set because: - the RDD gets saved to HDFS and the DAG gets truncated so if some partitions/executors fail it won't result in recomputing everything - you don't use memory for caching therefore the JVM heap is going to be smaller which

Re: Difference between Checkpointing and Persist

2019-04-18 Thread Jack Kolokasis
Hi,     in my point of view a good approach is first persist your data in StorageLevel.Memory_And_Disk and then perform join. This will accelerate your computation because data will be presented in memory and in your local intermediate storage device. --Iacovos On 4/18/19 8:49 PM, Subash Pr