I don't think it is a deliberate design. So you may need do action on the <data> RDD before the action of <temp> RDD, if you want to explicitly checkpoint <data> RDD.
2015-11-26 13:23 GMT+08:00 wyphao.2007 <wyphao.2...@163.com>: > Spark 1.5.2. > > 在 2015-11-26 13:19:39,"张志强(旺轩)" <zzq98...@alibaba-inc.com> 写道: > > What’s your spark version? > > *发件人:* wyphao.2007 [mailto:wyphao.2...@163.com] > *发送时间:* 2015年11月26日 10:04 > *收件人:* user > *抄送:* dev@spark.apache.org > *主题:* Spark checkpoint problem > > I am test checkpoint to understand how it works, My code as following: > > > > scala> val data = sc.parallelize(List("a", "b", "c")) > > data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at > parallelize at <console>:15 > > > > scala> sc.setCheckpointDir("/tmp/checkpoint") > > 15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be > non-local if Spark is running on a cluster: /tmp/checkpoint1 > > > > scala> data.checkpoint > > > > scala> val temp = data.map(item => (item, 1)) > > temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map > at <console>:17 > > > > scala> temp.checkpoint > > > > scala> temp.count > > > > but I found that only the temp RDD is checkpont in the /tmp/checkpoint > directory, The data RDD is not checkpointed! I found the doCheckpoint > function in the org.apache.spark.rdd.RDD class: > > > > private[spark] def doCheckpoint(): Unit = { > > RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, > ignoreParent = true) { > > if (!doCheckpointCalled) { > > doCheckpointCalled = true > > if (checkpointData.isDefined) { > > checkpointData.get.checkpoint() > > } else { > > dependencies.foreach(_.rdd.doCheckpoint()) > > } > > } > > } > > } > > > > from the code above, Only the last RDD(In my case is temp) will be > checkpointed, My question : Is deliberately designed or this is a bug? > > > > Thank you. > > > > > > > > > > > > > > > > > > > -- 王海华