On Jul 30, 2014, at 5:18 AM, Ankur Dave <ankurd...@gmail.com> wrote:

> Jeffrey Picard <jp3...@columbia.edu> writes:
>> As the program runs I’m seeing each iteration take longer and longer to 
>> complete, this seems counter intuitive to me, especially since I am seeing 
>> the shuffle read/write amounts decrease with each iteration. I would think 
>> that as more and more vertices converged the iterations should take a 
>> shorter amount of time. I can run on up to 150 of the 500 part files (stored 
>> on s3) and it finishes in about 12 minutes, but with all the data I’ve let 
>> it run up to 4 hours and it still doesn’t complete.
> 
> If GraphX is running close to the cluster's memory capacity, one possibility 
> is that Spark is dropping part of the graph from memory and causing 
> recomputation. The Spark web UI will show if this is the case: the Executors 
> page will show executors close to their memory limit, and the storage page 
> will show many RDDs with less than 100% cached blocks.
> 
> In that case you could allow Spark to spill partitions to disk by changing 
> the graph's storage level to MEMORY_AND_DISK or DISK_ONLY when you load the 
> graph.
> 
> Ankur

Thanks Ankur, my problem does sound as you described, so I think that’s 
probably it.

It seems that the version of graphx I’m using doesn't have the option for 
setting the storage level in the GraphLoader.edgeListFile method. 
https://spark.apache.org/docs/1.0.1/api/scala/index.html#org.apache.spark.graphx.GraphLoader$
 I tried unpersisting the edges and vertices of the graph by hand, then 
persisting the graph with persist(StorageLevel.MEMORY_AND_DISK). I still see 
the same behavior in connected components however, and the same thing you 
described in the storage page.

Storage
RDD Name        Storage Level   Cached Partitions       Fraction Cached Size in 
Memory  Size in Tachyon Size on Disk
VertexRDD       Memory Deserialized 1x Replicated       278     56%     50.6 GB 
0.0 B   0.0 B
VertexRDD       Disk Serialized 1x Replicated   498     100%    0.0 B   0.0 B   
32.4 GB
VertexRDD       Memory Deserialized 1x Replicated       435     87%     79.2 GB 
0.0 B   0.0 B
EdgeRDD Memory Deserialized 1x Replicated       492     98%     273.5 GB        
0.0 B   0.0 B
VertexRDD       Memory Deserialized 1x Replicated       395     79%     71.5 GB 
0.0 B   0.0 B
EdgeRDD Memory Deserialized 1x Replicated       263     53%     146.2 GB        
0.0 B   0.0 B
VertexRDD       Memory Deserialized 1x Replicated       400     80%     72.8 GB 
0.0 B   0.0 B
VertexRDD       Memory Deserialized 1x Replicated       179     36%     32.4 GB 
0.0 B   0.0 B
EdgeRDD Disk Serialized 1x Replicated   500     100%    0.0 B   0.0 B   96.0 GB
Would that (newer?) version of GraphX with the storage level settable in the 
edgeListFile possibly solve this, or could there still be something else going 
on?

Reply via email to