i wrote this piece based on all that, hopefully it will help: http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/ <http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/>
> On Jan 31, 2017, at 4:18 PM, Burak Yavuz <brk...@gmail.com> wrote: > > Hi Koert, > > When eager is true, we return you a new DataFrame that depends on the files > written out to the checkpoint directory. > All previous operations on the checkpointed DataFrame are gone forever. You > basically start fresh. AFAIK, when eager is true, the method will not return > until the DataFrame is completely checkpointed. If you look at the > RDD.checkpoint implementation, the checkpoint location is updated > synchronously therefore during the count, `isCheckpointed` will be true. > > Best, > Burak > > On Tue, Jan 31, 2017 at 12:52 PM, Koert Kuipers <ko...@tresata.com > <mailto:ko...@tresata.com>> wrote: > i understand that checkpoint cuts the lineage, but i am not fully sure i > understand the role of eager. > > eager simply seems to materialize the rdd early with a count, right after the > rdd has been checkpointed. but why is that useful? rdd.checkpoint is > asynchronous, so when the rdd.count happens most likely rdd.isCheckpointed > will be false, and the count will be on the rdd before it was checkpointed. > what is the benefit of that? > > > On Thu, Jan 26, 2017 at 11:19 PM, Burak Yavuz <brk...@gmail.com > <mailto:brk...@gmail.com>> wrote: > Hi, > > One of the goals of checkpointing is to cut the RDD lineage. Otherwise you > run into StackOverflowExceptions. If you eagerly checkpoint, you basically > cut the lineage there, and the next operations all depend on the checkpointed > DataFrame. If you don't checkpoint, you continue to build the lineage, > therefore while that lineage is being resolved, you may hit the > StackOverflowException. > > HTH, > Burak > > On Thu, Jan 26, 2017 at 10:36 AM, Jean Georges Perrin <j...@jgp.net > <mailto:j...@jgp.net>> wrote: > Hey Sparkers, > > Trying to understand the Dataframe's checkpoint (not in the context of > streaming) > https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean) > > <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)> > > What is the goal of the eager flag? > > Thanks! > > jg > > >