On Jul 30, 2014, at 5:18 AM, Ankur Dave <ankurd...@gmail.com> wrote:
> Jeffrey Picard <jp3...@columbia.edu> writes: >> As the program runs I’m seeing each iteration take longer and longer to >> complete, this seems counter intuitive to me, especially since I am seeing >> the shuffle read/write amounts decrease with each iteration. I would think >> that as more and more vertices converged the iterations should take a >> shorter amount of time. I can run on up to 150 of the 500 part files (stored >> on s3) and it finishes in about 12 minutes, but with all the data I’ve let >> it run up to 4 hours and it still doesn’t complete. > > If GraphX is running close to the cluster's memory capacity, one possibility > is that Spark is dropping part of the graph from memory and causing > recomputation. The Spark web UI will show if this is the case: the Executors > page will show executors close to their memory limit, and the storage page > will show many RDDs with less than 100% cached blocks. > > In that case you could allow Spark to spill partitions to disk by changing > the graph's storage level to MEMORY_AND_DISK or DISK_ONLY when you load the > graph. > > Ankur Thanks Ankur, my problem does sound as you described, so I think that’s probably it. It seems that the version of graphx I’m using doesn't have the option for setting the storage level in the GraphLoader.edgeListFile method. https://spark.apache.org/docs/1.0.1/api/scala/index.html#org.apache.spark.graphx.GraphLoader$ I tried unpersisting the edges and vertices of the graph by hand, then persisting the graph with persist(StorageLevel.MEMORY_AND_DISK). I still see the same behavior in connected components however, and the same thing you described in the storage page. Storage RDD Name Storage Level Cached Partitions Fraction Cached Size in Memory Size in Tachyon Size on Disk VertexRDD Memory Deserialized 1x Replicated 278 56% 50.6 GB 0.0 B 0.0 B VertexRDD Disk Serialized 1x Replicated 498 100% 0.0 B 0.0 B 32.4 GB VertexRDD Memory Deserialized 1x Replicated 435 87% 79.2 GB 0.0 B 0.0 B EdgeRDD Memory Deserialized 1x Replicated 492 98% 273.5 GB 0.0 B 0.0 B VertexRDD Memory Deserialized 1x Replicated 395 79% 71.5 GB 0.0 B 0.0 B EdgeRDD Memory Deserialized 1x Replicated 263 53% 146.2 GB 0.0 B 0.0 B VertexRDD Memory Deserialized 1x Replicated 400 80% 72.8 GB 0.0 B 0.0 B VertexRDD Memory Deserialized 1x Replicated 179 36% 32.4 GB 0.0 B 0.0 B EdgeRDD Disk Serialized 1x Replicated 500 100% 0.0 B 0.0 B 96.0 GB Would that (newer?) version of GraphX with the storage level settable in the edgeListFile possibly solve this, or could there still be something else going on?