Hello,

In a thread about "java.lang.StackOverflowError when calling count()" [1] I
saw Tathagata Das share an interesting approach for truncating RDD lineage -
this helps prevent StackOverflowErrors in high iteration jobs while avoiding
the disk-writing performance penalty. Here's an excerpt from TD's post:

If you are brave enough, you can try the following. Instead of relying on
checkpointing to HDFS for truncating lineage, you can do the following.
1. Persist Nth RDD with replication (see different StorageLevels), this
would replicated the in-memory RDD between workers within Spark. Lets call
this RDD as R.
2. Force it materialize in the memory.
3. Create a modified RDD R` which has the same data as RDD R but does not
have the lineage. This is done by creating a new BlockRDD using the ids of
blocks of data representing the in-memory R (can elaborate on that if you
want).

This will avoid writing to HDFS (replication in the Spark memory), but
truncate the lineage (by creating new BlockRDDs), and avoid stackoverflow
error.

---------------------------------------------------------------------

Now I'm not sure how to do no. 3. Any ideas? I'm CC'ing Tathagata too.

Cheers,
Nilesh

[1]:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201405.mbox/%3ccamwrk0kiqxhktfuaamhborov5lv+d8y+c5nycmsxtqasze4...@mail.gmail.com%3E



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Alternative-to-checkpointing-and-materialization-for-truncating-lineage-in-high-iteration-jobs-tp8488.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to