i wrote this piece based on all that, hopefully it will help:
http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/ 
<http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/>

> On Jan 31, 2017, at 4:18 PM, Burak Yavuz <brk...@gmail.com> wrote:
> 
> Hi Koert,
> 
> When eager is true, we return you a new DataFrame that depends on the files 
> written out to the checkpoint directory.
> All previous operations on the checkpointed DataFrame are gone forever. You 
> basically start fresh. AFAIK, when eager is true, the method will not return 
> until the DataFrame is completely checkpointed. If you look at the 
> RDD.checkpoint implementation, the checkpoint location is updated 
> synchronously therefore during the count, `isCheckpointed` will be true.
> 
> Best,
> Burak
> 
> On Tue, Jan 31, 2017 at 12:52 PM, Koert Kuipers <ko...@tresata.com 
> <mailto:ko...@tresata.com>> wrote:
> i understand that checkpoint cuts the lineage, but i am not fully sure i 
> understand the role of eager. 
> 
> eager simply seems to materialize the rdd early with a count, right after the 
> rdd has been checkpointed. but why is that useful? rdd.checkpoint is 
> asynchronous, so when the rdd.count happens most likely rdd.isCheckpointed 
> will be false, and the count will be on the rdd before it was checkpointed. 
> what is the benefit of that?
> 
> 
> On Thu, Jan 26, 2017 at 11:19 PM, Burak Yavuz <brk...@gmail.com 
> <mailto:brk...@gmail.com>> wrote:
> Hi,
> 
> One of the goals of checkpointing is to cut the RDD lineage. Otherwise you 
> run into StackOverflowExceptions. If you eagerly checkpoint, you basically 
> cut the lineage there, and the next operations all depend on the checkpointed 
> DataFrame. If you don't checkpoint, you continue to build the lineage, 
> therefore while that lineage is being resolved, you may hit the 
> StackOverflowException.
> 
> HTH,
> Burak
> 
> On Thu, Jan 26, 2017 at 10:36 AM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Hey Sparkers,
> 
> Trying to understand the Dataframe's checkpoint (not in the context of 
> streaming) 
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)
>  
> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)>
> 
> What is the goal of the eager flag?
> 
> Thanks!
> 
> jg
> 
> 
> 

Reply via email to