Hi,

I have a 1-action job (saveAsObjectFile at the end), that includes several
stages. One of those stages is an expensive join "rdd1.join(rdd2)". I would
like to checkpoint rdd1 right before the join to improve the stability of
the job. However, what I'm seeing is that the job gets executed all the way
to the end (saveAsObjectFile) without doing any checkpointing, and then
re-runing the computation to checkpoint rdd1 (when I see the files saved to
the checkpoint directory). I have no issue with recomputing, given that I'm
not caching rdd1, but the fact that the checkpointing of rdd1 happens after
the join brings no benefit because the whole DAG is executed in one piece
and the job fails. If that is actually what is happening, what would be the
best approach to solve this? 
What I'm currently doing is to manually save rdd1 to HDFS right after the
filter in line (4) and then load it back right before the join in line (11).
That prevents the job from failing by splitting it into 2 jobs (ie. 2
actions). My expectations was that rdd1.checkpoint in line (8) was going to
have the same effect but without the hassle of manually saving and loading
intermediate files.

///////////////////////////////////////////////

(1)   val rdd1 = loadData1
(2)     .map
(3)     .groupByKey
(4)     .filter
(5)
(6)   val rdd2 = loadData2
(7)
(8)   rdd1.checkpoint()
(9)
(10)  rdd1
(11)    .join(rdd2)
(12)    .saveAsObjectFile(...)

/////////////////////////////////////////////

Thanks in advance,
Leo



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-checkpoint-and-RDD-after-a-stage-and-before-reaching-an-action-tp20852.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to