Hi Subash,
I'm not sure how the checkpointing works, but with
StorageLevel.MEMORY_AND_DISK, Spark will store the RDD in on-heap memory,
and spill to disk if necessary. However, the data is only usable by that
Spark job. Saving the RDD will write the data out to an external storage
system, like HDF
saving/checkpointing would be preferable in case of a big data set because:
- the RDD gets saved to HDFS and the DAG gets truncated so if some
partitions/executors fail it won't result in recomputing everything
- you don't use memory for caching therefore the JVM heap is going to be
smaller which
Hi,
in my point of view a good approach is first persist your data in
StorageLevel.Memory_And_Disk and then perform join. This will accelerate
your computation because data will be presented in memory and in your
local intermediate storage device.
--Iacovos
On 4/18/19 8:49 PM, Subash Pr