Jeffrey Picard <jp3...@columbia.edu> writes: > As the program runs I’m seeing each iteration take longer and longer to > complete, this seems counter intuitive to me, especially since I am seeing > the shuffle read/write amounts decrease with each iteration. I would think > that as more and more vertices converged the iterations should take a shorter > amount of time. I can run on up to 150 of the 500 part files (stored on s3) > and it finishes in about 12 minutes, but with all the data I’ve let it run up > to 4 hours and it still doesn’t complete.
If GraphX is running close to the cluster's memory capacity, one possibility is that Spark is dropping part of the graph from memory and causing recomputation. The Spark web UI will show if this is the case: the Executors page will show executors close to their memory limit, and the storage page will show many RDDs with less than 100% cached blocks. In that case you could allow Spark to spill partitions to disk by changing the graph's storage level to MEMORY_AND_DISK or DISK_ONLY when you load the graph. Ankur