Re: RE: Spark checkpoint problem

eric wong Thu, 26 Nov 2015 01:22:09 -0800

I don't think it is a deliberate design.

So you may need do action on the <data> RDD before the action of <temp>
RDD, if you want to explicitly checkpoint <data> RDD.



2015-11-26 13:23 GMT+08:00 wyphao.2007 <wyphao.2...@163.com>:

> Spark 1.5.2.
>
> 在 2015-11-26 13:19:39，"张志强(旺轩)" <zzq98...@alibaba-inc.com> 写道：
>
> What’s your spark version?
>
> *发件人:* wyphao.2007 [mailto:wyphao.2...@163.com]
> *发送时间:* 2015年11月26日 10:04
> *收件人:* user
> *抄送:* dev@spark.apache.org
> *主题:* Spark checkpoint problem
>
> I am test checkpoint to understand how it works, My code as following:
>
>
>
> scala> val data = sc.parallelize(List("a", "b", "c"))
>
> data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at
> parallelize at <console>:15
>
>
>
> scala> sc.setCheckpointDir("/tmp/checkpoint")
>
> 15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be
> non-local if Spark is running on a cluster: /tmp/checkpoint1
>
>
>
> scala> data.checkpoint
>
>
>
> scala> val temp = data.map(item => (item, 1))
>
> temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map
> at <console>:17
>
>
>
> scala> temp.checkpoint
>
>
>
> scala> temp.count
>
>
>
> but I found that only the temp RDD is checkpont in the /tmp/checkpoint
> directory, The data RDD is not checkpointed! I found the doCheckpoint
> function  in the org.apache.spark.rdd.RDD class:
>
>
>
>   private[spark] def doCheckpoint(): Unit = {
>
>     RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false,
> ignoreParent = true) {
>
>       if (!doCheckpointCalled) {
>
>         doCheckpointCalled = true
>
>         if (checkpointData.isDefined) {
>
>           checkpointData.get.checkpoint()
>
>         } else {
>
>           dependencies.foreach(_.rdd.doCheckpoint())
>
>         }
>
>       }
>
>     }
>
>   }
>
>
>
> from the code above, Only the last RDD(In my case is temp) will be
> checkpointed, My question : Is deliberately designed or this is a bug?
>
>
>
> Thank you.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>



-- 
王海华

Re: RE: Spark checkpoint problem

Reply via email to