Union of checkpointed RDD in Apache Spark has long (> 10 hour) between-stage latency

Peng Cheng Sun, 17 May 2015 14:59:23 -0700

I'm implementing one of my machine learning/graph analysis algorithm on
Apache Spark:


The algorithm is very iterative (like all other ML algorithms), but it has a
rather strange workflow: first a subset of all training data (called seeds
RDD: {S_1} is randomly selected) and loaded, in each iteration, the seeds
{S_n} will update itself {S_n+1} and yield a model RDD: {M_n}. After the
seeds have reached a condition the iteration will stop and all model RDD are
aggregated to yield the final result.

Like all iterative implementation in MLLib, both {S_} and {M_} has to be
checkpointed regularly (which seems to be more efficient than commiting {M}
into a growing RDD and cache/checkpoint it: old data already on HDFS don't
have to be written into disk again or take memory space until the final
stage).

However, before the final step when all {M_*} are aggregated. The spark
seems to get frozen: all stages/jobs are completed, no new stage/job are
pending, and all drivers and clusters are running but doing nothing (the
algorithm is still far from completion).

I have to wait for 10+ hours before it start to proceed. So the latency
between stages on UI looks really weird (see the sharp contrast between 15s
task running time and 10h+ between-stage latency?):

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n22925/zQxJQ.png> 

I wonder if my implementation for algorithm is not optimized for Spark? Or I
simply encounter a hidden issue? Thanks a lot for your opinion



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-10-hour-between-stage-latency-tp22925.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Union of checkpointed RDD in Apache Spark has long (> 10 hour) between-stage latency

Reply via email to