I'm implementing one of my machine learning/graph analysis algorithm on Apache Spark:
The algorithm is very iterative (like all other ML algorithms), but it has a rather strange workflow: first a subset of all training data (called seeds RDD: {S_1} is randomly selected) and loaded, in each iteration, the seeds {S_n} will update itself {S_n+1} and yield a model RDD: {M_n}. After the seeds have reached a condition the iteration will stop and all model RDD are aggregated to yield the final result. Like all iterative implementation in MLLib, both {S_} and {M_} has to be checkpointed regularly (which seems to be more efficient than commiting {M} into a growing RDD and cache/checkpoint it: old data already on HDFS don't have to be written into disk again or take memory space until the final stage). However, before the final step when all {M_*} are aggregated. The spark seems to get frozen: all stages/jobs are completed, no new stage/job are pending, and all drivers and clusters are running but doing nothing (the algorithm is still far from completion). I have to wait for 10+ hours before it start to proceed. So the latency between stages on UI looks really weird (see the sharp contrast between 15s task running time and 10h+ between-stage latency?): <http://apache-spark-user-list.1001560.n3.nabble.com/file/n22925/zQxJQ.png> I wonder if my implementation for algorithm is not optimized for Spark? Or I simply encounter a hidden issue? Thanks a lot for your opinion -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-10-hour-between-stage-latency-tp22925.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org