How to checkpoint and RDD after a stage and before reaching an action?

2017-02-04 Thread leo9r
Hi, I have a 1-action job (saveAsObjectFile at the end), that includes several stages. One of those stages is an expensive join "rdd1.join(rdd2)". I would like to checkpoint rdd1 right before the join to improve the stability of the job. However, what I'm seeing is that the job gets executed all t

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread leo9r
That's great insight Mark, I'm looking forward to give it a try!! According to jira's Adaptive execution in Spark , it seems that some functionality was added in Spark 1.6.0 and the rest is still in progress. Are there any improvements to the Sp

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-14 Thread leo9r
Hi Daniel, I completely agree with your request. As the amount of data being processed with SparkSQL grows, tweaking sql.shuffle.partitions becomes a common need to prevent OOM and performance degradation. The fact that sql.shuffle.partitions cannot be set several times in the same job/action, bec