Hi, > My understanding is that .localCheckpoint() breaks the lineage of the RDD
True. > and this requires that the entire RDD to be rebuild instead of being able to recompute lost partitions. In a sense, it's as if you saved the partitions to executors and re-read them back as source data (for this checkpointed RDD). > Does each executor store a copy of the entire RDD? No. An executor has got only the data of the partitions (for the tasks this executor has executed). > Checkpoint over .localCheckpoint. checkpoint is similar to localCheckpoint, but slower and reliable (as it's on a stable HDFS file system not on an ephemeral executor). In either case, the lineage should be the same = cut. Pozdrawiam, Jacek Laskowski ---- https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Wed, Jan 6, 2021 at 6:15 PM brettplarson <brettpatricklar...@gmail.com> wrote: > Hello, > I am wondering what the impact of using .localCheckpoint() and having the > executor die would be? > > My understanding is that .localCheckpoint() breaks the lineage of the RDD > and this requires that the entire RDD to be rebuild instead of being able > to > recompute lost partitions. > > Does each executor store a copy of the entire RDD? > > It's unclear to me the benefit of using Checkpoint over .localCheckpoint. > (I > am aware that this is HDFS backed, but it's unclear the implications of > this) > > Please let me know, > Thank you! > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >